M/iss.  m.&  rz\5 


31E0bb  D271  3MST  E 


FINAL  REPORT 


Feasibility  Study  of  Alternative  Ways 

for  Providing  Basic  Skills  Tests  to 

Massachusetts  School  Systems  as 

Required  under  the  Massachusetts 

Basic  Skills  Improvement  Policy 


Prepared  by: 

Richard  K.  Hill 
Richard  Lyczak 

RMC  Research  Corporation 


GOVERNMENT  DOCUMENTS 
COLLECTIOU 

APR  1 8 1933 

Otaity  of  Massachusetts 
Depository  Copy 


for: 

Massachusetts  Department  of  Education 

Bureau  of  Research  and  Assessment 

31  St.  James  Avenue 

Boston,  Massachusetts   02116 


June,  1980 


MASSACHUSETTS  BOARD  OF  EDUCATION 


Ann  H.  McHugh,  Chairperson 
James  L*  Green,  Vice  Chairperson 


John  Anthony 
John  W.  Bond 
Millie  Clements 
Howard  A.  Greis 
Charles  T.  Grigsby 
Mary  Ann  Hardenbergh 
Armando  Martinez 
Joseph  C.  Mello,  Jr. 
Edwin  M.  Rossman 
Donald  R.  Walker 


Gregory  R.  Anrig,  Commissioner  of  Education,  Secretary 

Ex-Officio  Member:   Laura  Clausen,  Chancellor,  Board  of  High  Education 


Prepared  by  the  Bureau  of  Research  and  Assessment 


Allan  S.  Hartman,  Director 

Tracy  Libros 

Leslie  S.  May 

Matthew  H.  Towle 

Dorothy  Earle 

Betty  Hancock 

Cathie  Ridge 


Produced  by  the  Bureau  of  Educational  Information  Services 


Richard  A.  Gilman,  Director 

Cynthia  Nadreau,  Production  Coordinator 

The  Massachusetts  Department  of  Education  insures  equal  employment /educational 
opportunities/and  affirmative  action,  regardless  of  race,  color,  creed,  national 
origin,  or  sex,  in  compliance  with  Title  IX,  or  handicap,  in  compliance  with 
Section  504 . 


Publication  #  11913  approved  by  Alfred  C.  Holland,  State  Purchasing  Agent 


FINAL  REPORT 


Feasibility  Study  of  Alternative  Ways 

for  Providing  Basic  Skills  Tests  to 

Massachusetts  School  Systems  as 

Required  under  the  Massachusetts 

Basic  Skills  Improvement  Policy 


Prepared  by: 

Richard  K.  Hill 
Richard  Lyczak 

RMC  Research  Corporation 

111  Bow  Street 
Portsmouth,  New  Hampshire 


for: 

Massachusetts  Department  of  Education 

Bureau  of  Research  and  Assessment 

31  St.  James  Avenue 

Boston,  Massachusetts 


June,  1980 


ACKNOWLEDGEMENT  S 


RMC  and  the  Bureau  of  Research  and  Assessment  wish  to  express  their 
gratitude  to  the  following  members  of  the  Review  Committee  for  the 
assistance  they  provided  during  the  course  of  this  contract.   By  offering 
guidance  and  constructive  criticism,  they  greatly  influenced  the  direction 
of  this  study,  as  well  as  the  content  of  interviews,  workshops,  and  reports 


Pamela  Almeida 


Reading  Public  Schools 


Fred  Andelman 


Massachusetts  Teachers  Association 


Gary  Baker 


Acton  Public  Schools 


Robert  Coelho 


Attleboro  Public  Schools 


Joseph  Crick 


University  of  Massachusetts-Boston 


Parker  Damon 


Acton  Public  Schools 


Jid  Kamitian 


Holliston  Public  Schools 


Ed  Reidy 


Fitchburg  Public  Schools 


Charlotte  Ryan 


Parent  Student  Teacher  Association 


We  are  also  indebted  to  the  following  Regional  Linkers  for  setting 
up  the  regional  workshops  and  providing  feedback  on  workshop  materials. 


Winifred  Green 


Pittsfield  Regional  Center 


Sydney  Pierce 


Springfield  Regional  Center 


John  Schomer 


Central  Massachusetts  Regional  Center 


Paul  Francis 


Southeast  Regional  Center 


Carol  Thomson 


Boston  Regional  Center 


Peter  Coffin 


Northeast  Regional  Center 


Executive  Summary 


Beginning  in  October,  1979,  RMC  Research  Corporation  undertook  a 
study  to  determine  the  feasibility  of  various  methods  by  which  the 
Department  of  Education  might  fulfill  its  obligation  to  provide  local 
districts  with  basic  skills  tests.   As  a  result  of  that  study  RMC  is 
making  three  major  recommendations  to  the  Department  of  Education. 

I.   The  Department  should  develop  a  pool  of  approximately 
1,200  items  in  reading  and  mathematics  and  construct 
an  additional  12  items  in  writing. 

II.   Each  year  the  Department  should  develop  two  equivalent 
forms  of  a  test  from  the  item  pool,  and  provide  camera- 
ready  copy  of  those  forms  to  participating  school  districts . 

III.   The  Department  should  release  the  entire  item  pool  to 

participating  districts  and  permit  them,  within  certain 
guidelines,  to  revise  the  State's  equivalent  forms  or 
develop  their  own  test. 

The  above  recommendations  were  the  result  of  a  two  stage  information 
gathering  process.   RMC  began  by  generating  descriptions  of  ten  potential 
systems  for  providing  school  districts  with  tests.   Differences  ampng 
these  systems  tended  to  revolve  around  six  key  issues: 

(1)  Whether  the  test  should  be  long,  or  short 

(2)  Whether  multiple  forms  were  necessary 

(3)  Whether  an  item  pool  should  be  released 

(4)  Whether  the  test  delivery  system  should  be  computerized 

(5)  Whether  items  should  be  written  on  request 

(6)  Whether  norms  should  be  provided 

In  order  to  narrow  the  number  of  alternatives  under  consideration,  RMC 
discussed  each  of  these  issues  in  depth  with  teachers  and  administrators 
in  a  stratified  random  sample  of  eight  school  districts  around  the  state. 
In  addition,  the  feasibility  of  various  options  was  discussed  with  a 
variety  of  professionals  in  the  field  of  testing  and  measurement.   By  the 
end  of  this  initial  phase  of  the  study  it  was  clear  that  the  system  adopted 
would  have  to: 

(1)  be  available  to  districts  at  no  extra  cost. 

(2)  be  maintainable  by  existing  staff  in  the  Bureau  of  Research 
and  Assessment. 

(3)  provide  districts  with  two  forms  of  a  test  each  year. 

(4)  be  usable  by  districts  with  no  additional  effort. 

(5)  be  understandable  to  district  personnel  and  the  public 
at  large,, 


li 


In  addition  it  was  clear  that  the  system  should: 

(1)  provide  short  tests  rather  than  long. 

(2)  be  flexible  enough  to  accomodate  future  changes  in  the  Basic 
Skills  Improvement  Policy. 

(3)  be  flexible  enough  for  districts  to  use  the  tests  for  a 
variety  of  purposes. 

The  system  which  came  closest  to  meeting  these  requirements  is  the  system 
currently  in  use  by  the  Department  of  Education.   By  incorporating  desirable 
features  of  other  systems,  however,  it  was  felt  that  the  current  system 
could  be  made  more  flexible,  cost  effective,  and  responsive  to  concerns 
expressed  by  districts  in  the  interview  sample.   In  its  initial  report, 
therefore,  RMC  concluded  that  the  Department  of  Education  should  continue 
providing  two  equivalent  forms  of  a  test  to  participating  districts  each 
year,  much  as  it  is  doing  now.   Instead  of  assembling  these  tests  from 
scratch  each  year,  however,  it  was  recommended  that  an  item  pool  be  created 
from  which  subsequent  forms  could  be  developed.   Unresolved  in  the  initial 
report  was  the  issue  of  whether  the  item  pool  should  be  released  for  local 
district  use,  as  well  as  being  maintained  for  Department  use,  and  what  type 
of  item  statistics  would  best  facilitate  use  of  the  pool. 

In  its  second  phase  of  data  gathering,  RMC  conducted  six  four-hour 
workshops  around  the  state  in  cooperation  with  the  six  regional  education 
centers.   These  workshops  reviewed  the  two  remaining  issues  with  participants 
and  provided  hands-on  experience  in  using  a  simulated  item  pool  to  answer 
a  series  of  specific  test  construction  and  test  interpretation  questions. 
In  a  questionnaire  administered  at  the  end  of  the  workshop,  96%  of  the 
participants  favored  release  of  the  item  pool.   Among  the  alternative  item... 
statistics  discussed,  p-values  (the  proportion  of  students  answering  an  .item 
correctly)  were  the  most  popular.   In  response  to  this  feedback,  RMC  concluded 
that  the  item  pool  should  be  released  with  p-values  accompanying  each  item. 

In  order  to  implement  this  system,  RMC  has  recommended  that  the  Depart- 
ment of  Education  purchase  an  existing  item  pool  or  portions  of  existing 
item  pools  which  match  the  State's  basic  skills  objectives.   Items  should  be 
screened  by  a  review  committee.   Items  selected  for  inclusion  in  the  pool 
should  be  pilot  tested  along  with  items  from  the  current  basic  skills  tests 
to  obtain  estimated  ninth  and  twelfth  grade  norms.   Because  norms  would  be 
estimated  via  a  linking  process,  on  a  sample  of  convenience,  districts  would 
know  the  relative  difficulties  of  items  in  the  pool,  but  would  be  unable  to 
make  any  meaningful  comparisons  between  their  district  and  the  State  as  a 
whole.  While  the  procedure  of  estimating  norms  was  actually  recommended  for 
fiscal  reasons,  it  also  avoids  the  politically  sensitive  issue  of  statewide 
comparisons.   Although  statewide  comparisons  can  be  made  under  the  current 
system,  23%  of  the  workshop  participants  and  two  out  of  the  eight  districts 
interviewed  were  opposed  to  such  comparisons. 

Once  the  item  pool  has  been  scaled,  it  should  be  printed  and  distributed 
to  local  districts  which  attend  a  workshop  on  its  use.   As  part  of  that 
workshop,  districts  should  be  introduced  to  a  set  of  test  construction 
procedures  designed  to  insure  that  the  tests  they  prepare  are  equivalent  to 
the  two  forms  released  by  the  Department  each  year. 


in 


RMC  feels  that  implementation  of  the  above  recommendations  will  allow 
districts  maximum  flexibility  in  assessing  basic  skills  while  guaranteeing 
the  use  of  high  quality  assessment  instruments.   We  are  convinced  that  the 
proposed  system  fulfills  the  state's  obligation  to  provide  basic  skills 
tests  in  a  way  which  will  be  both  cost  effective  and  acceptable  to  the 
greatest  number  of  school  districts  in  Massachusetts. 


IV 


TABLE   OF   CONTENTS 


6  •  •  »  e  8  ■  ■  t  i  c  c  c  e  e  e  o  o 


Acknowledgements  

Executive  Summary   ........    ii 


Table  of  Contents 


c  e  e  e  o 


v 


J_  «  InXrOQUL  LlUll       eo««o«*«**»eea«acce*oc«e  -L 

II.   The  Project  Activities .  .  .  .  „  .  .  .  3 

III.   Selecting  an  Interview  Sample  and  Conducting  7 

LiiC        lIlLcrVlCwb  •eeeaeeeeaeesceecooec 

IV.   Feedback  on  the  Issues  ..........ceo  10 

V.   One  Additional  Issue    ...................  19 

VI.   Features  of  a  Feasible  System   .........  20 

VII.   Feasibility  of  a  Proposed  System  .........  22 

VIII.   Summary  of  Initial  Decisions  .......  25 

IX.   The  Workshops   ............  27 

X.   Feedback  from  the  Workshops   ...........  31 

XI.   Scaling  the  Item  Pool   ...................  35 

XII.   Criteria  for  a  State  Test   ....  ....  ....  38 

XIII.   Recommendations   ......................  40 

■Tippt-iiQXX       t\        •••••co*oaeocQoeoccocs«»c  i\"™X 

t\\j  U  C  I1CJ  -L  .A.       0o«*ceeeooooo»oooe090oeo«e  D""X 

Appendix  C C-l 

i\pP  CllQlX       U        o  *         e  e  •  o  •  o  •  c  o  »         »  o         e         o         o  ©         •  ©  e         «  c  o  o  U  ""=  J_ 


In  August,  1978,  the  Massachusetts  State  Board  of  Education  adopted 
a  Basic  Skills  Improvement  Policy.   This  Policy,  followed  by  Regulations 
in  January,  1979,  requires  that  all  public  school  districts  in  Massachusetts 
develop  basic  skills  improvement  programs.   As  part  of  their  program, 
school  districts  must  develop  basic  skills  plans.   These  secondary  level 
plans  need  to  incorporate  the  basic  skills  objectives  for  reading,  writing, 
mathematics,  listening,  and  speaking  that  are  specified  in  the  Regulations. 


I.    INTRODUCTION 

Districts  must  evaluate  student  achievement  of  these  objectives,  ana 
must  choose  one  or  more  of  the  following  evaluation  instruments  at  the  second- 
ary level  to  evaluate  such  achievement. 

1.  evaluation  instruments  available  from  the  Department 
of  Education 

2.  commercially  available  evaluation  instruments  approved 
by  the  Department  of  Education 

3.  locally  developed  evaluation  instruments  approved  by 
the  Department  of  Education  as  being  comparable  to 
one  of  the  above. 

To  partially  fulfill  its  obligations  under  the  first  alternative,  the 
Bureau  of  Research  and  Assessment  has  developed  four  equivalent  forms  of  tests 
in  reading,  writing,  and  mathematics.   In  developing  long  range  plans,  however, 
the  Bureau  questioned  whether  the  current  approach  to  providing  a  test  was  tne 
best  way  for  the  Department  of  Education  to  discharge  its  responsibility.   Con- 
sequently, the  Bureau  awarded  a  contract  in  October,  1979,  to-RMC  Research 
Corporation  to  determine  the  feasibility  of  alternative  methods  of  providing 
tests  to  school  systems  throughout  Massachusetts.   This  report  is  a  documenta- 
tion of  the  activities  RMC  undertook  to  accomplish  the  task  and  a  listing  of  kfrC's 
recommendations  for  revising  the  current  system. 

RMC  began  the  study  by  producing  two  documents.   The  first,  provided  in 
Appendix  A,  describes  10  possible  ways  the  Department  could  provide  "an  evalua- 
tion instrument."  By  its  very  nature,  the  document  was  redundant  and  technically 
written,  and  proved  to  be  a  poor  communication  device.   Consequently,  RMC 
required  resolution  before  a  system  could  be  recommended.   After  meeting  witn 
staff  from  the  Bureau  and  the  Review  Committee,  RMC  visited  eight  local  school 
districts  and  contacted  several  agencies  that  could  provide  information  and 
feedback  on  the  issues  that  had  been  raised. 

After  those  activities  were  completed,  it  was  clear  that  several  issues 
were  resolved,  while  several  others  needed  further  investigation  and  feedback 
from  local  school  districts  before  closure  on  a  final  set  of  recommendations . 
could  be  made.   RMC  wrote  an  initial  report  to  provide  the  Bureau,  the 
Review  Committee,  and  the  Advisorty  Committee  on  Basic  Skills  Improvement  with 
specific  feedback  on  its  findings  to  that  point.   RMC  also  developed  a  workshop 
for  local  school  districts  which  was  presented  at  the  six  Regional  Centers  of 
the  Department  of  Education  in  March.   The  feedback  obtained  from  the  participants 
at  the  workshops  was  incorporated  into  the  recommendations  presented  in  this 
report. 

The  remainder  of  this  report  is  divided  into  twelve  sections.   Section 
II  provides  an  overview  of  project  activities  and  discusses  information 
gathered  from  sources  other  than  local  school  districts.   Sections  III-V 
describe  the  planning  and  operation  of  the  interviews  and  the  feedback 
obtained  from  them.   Section  VI  synthesizes  that  feedback  into  a  set  of 
essential  and  desirable  features  that  any  system   judged  to  be  feasible  for 
the  Basic  Skills  program  should  have,  and  Section  VII  explains  why  none 
of  the  systems  originally  proposed  has  all  those  features. 


Section  VIII  is  a  summary  and  transition  section.   It  is  a  compilation 
of  the  issues  that  were  resolved  prior  to  the  writing  of  the  initial  report 
and  describes  some  of  the  aspects  of  the  recommended  system  that  were  set 
by  that  point.   Sections  IX  and  X  describe  the  planning  and  operation  of  the 
March  workshops  and  the  feedback  obtained  from  them.   Sections  XI  and  XII 
present  discussions  of  two  additional  issues  that  were  raised  and  resolved 
during  the  workshops.   Finally,  Section  XIII  is  a  set  of  recommendations 
which,  taken  together,  comprise  the  system  that  RMC  proposes  Massachusetts 
adopt  to  meet  its  obligations  to  provide  evaluation  instruments  under  the 
Basic  Skills  Improvement  Program. 


II.    THE  PROJECT  ACTIVITIES 

Shortly  after  the  award  of  the  feasibility  contract,  RMC  wrote  a 
planning  document  which  was  to  serve  as  an  outline  of  the  activities  to 
be  conducted  under  the  contract,  as  well  as  to  establish  a  timeline  for 
those  activities.   On  September  26,  1979,  RMC  staff  met  with  members  of 
the  Bureau  staff.   Both  groups  met  with  members  of  the  Review  Committee  on 
October  9,  1979.   During  that  meeting  the  planning  document  was  finalized. 

Since  those  meetings,  RMC  has  been  involved  in  several  activities  to 
gather  the  information  necessary  to  write  this  report.   The  major  activities 
were  the  visits  to  eight  school  districts  and  the  six  workshops  held  in 
March.   A  description  of  those  activities  is  provided  in  later  sections  of 
this  report.   RMC  also  contacted  several  other  sources  for  information, 
opinions  and  ideas;  the  results  of  those  contacts  are  reported  in  this  section, 
followed  by  the  information  RMC  has  collected  to  date  on  the  availability  of 
item  pools  that  might  be  of  use  to  Massachusetts. 

In  addition,  RMC  staff  have  been  involved  in  three  other  activities.   On 
October  2,  1979,  they  attended  a  workshop  conducted  by  the  Bureau  on  the  results 
of  last  year's  assessment  program.   On  November  29,  1979,  they  attended  a 
workshop  in  which  districts  were  trained  in  procedures  for  setting  standards 
under  the  Basic  Skills  Improvement  Policy.   RMC  has  met  with  Bureau  staff 
on  six  other  occasions  to  coordinate  project  activities  and  to  provide  in- 
struction on  latent  trait  theory  (a  method  for  scaling  and  equating  questions 
and  tests)  to  the  Bureau  staff. 

Data  from  Sources  Other  than  Interviews  and  Workshops 

In  addition  to  the  interviews  and  workshops,  RMC  contacted  other  agencies 
and  individuals  whose  input  might  contribute  to  determining,  or  setting  up, 
a  feasible  test  delivery  system  for  Massachusetts.   Some  of  these  contacts 
were  made  at  the  suggestion  of  the  Bureau,  others  were  made  on  the  initiative  of 
RMC  staff.   Although  not  all  attempts  to  contact  people  were  successful,  they 
are  listed  below  as  a  means  of  documenting  the  efforts  made  during  the  first 
three  months  of  the  contract. 

Virginia  Department  of  Education.   Paul  Williams,  Test  Development  Super- 
visor for  the  Virginia  Department  of  Education,  described  the  three-part 
Virginia  state  assessment  program  to  us.   The  three  phases  are:   (1)  norm 
referenced  statewide  assessment,  (2)  the  Virginia  Basic  Learning  Skills  Program, 
and  (3)  graduation  competency  testing.   Paul  felt  that  the  Basic  Learning  Skills 
Program  (reading,  Math,  and  communication  skills)  would  be  of  most  interest  to 
Massachusetts. 

Virginia  contracts  with  National  Evaluation  Systems  of  Amherst,  MA,  to 
write  and  field  test  items  for  this  program.   NES  then  forwards  the  field  test 
data  to  the  Virginia  Department  of  Education  for  analysis.   Items  are  all  Rasch 
scaled  and  the  final  form  of  the  test  is  constructed  by  the  Department.   The 
test  is  then  used  to  identify  students  for  remediation,  using  Rasch  person  mis- 
fit statistics.   Graduation  competency  testing  is  accomplished  in  much  the  same 
way  except  that  the  contractors  for  item  development  are  Instructional 
Objectives  Exchange  and  Scholastic  Testing  Service. 


Wisconsin  Department  of  Education.   At  the  Wisconsin  Department  of 
Education,  RMC  talked  with  Vicky  Frederick  about   the  item  bank  which  they 
maintain  for  the  use  of  local  districts.   Since  Wisconsin  does  not  have  a 
statewide  competency  testing  requirement,  the  Department  simply  assists 
local  districts  which  elect  to  establish  their  own  minimum  competencies. 
Assistance  is  in  the  form  of  matching  a  district's  minimum  competency  objec- 
tives in  the  Northwest  Educational  Association  item  bank,  and  providing 
appropriate  items  to  the  district  from  the  bank.   The  items  are  all  filed 
on  cards,  and  Wisconsin  has  no  immediate  plans  to  computerize  the  system. 
Vicky  suggested  that  RMC  contact  the  Northwest  Educational  Association  for 
specific  information  about  the  item  bank. 

Northwest  Educational  Association.   According  to  Susan  Holmes,  NWEA 
has  three  item  banks:   mathematics,  reading,  and  language  arts.   There  are 
approximately  1200  items  (primarily  grades  3-8  )  in  each  bank,  and  Rasch 
difficulty  values  are  reported  for  each  item.   NWEA  is  a  non  profit  organi- 
zation.  The  local  districts  who  developed  the  items  are  merely  interested 
in  recovering  their  costs.   NWEA  is  especially  interested  in  negotiating 
with  states  or  local  districts  who  will  help  them  field  test  additional  items. 
The  NWEA  Board  approved  a  motion  at  its  February  meeting  that  "license  to 
reproduce  and  use  items  for  State  Agency  purposes  be  granted  the  State  of 
Massachusetts  for  $10,000  plus  a  one  time  assessment  of  10c  per  student." 
This  is  a  preliminary  offer  which  can  be  further  negotiated. 

Connecticut  Department  of  Education.   Joan  Baron  indicated  that  several 
communities  in  Connecticut  have  developed  their  own  local  assessment  instru- 
ments.  To  her  knowledge,  however,  none  of  these  was  in  the  form  of  an  item 
bank.   She  also  indicated  that  these  districts  felt  very  protective  about 
what  they  had  developed  and  were  unlikely  to  share  their  materials  with 
Massachusetts  or  other  districts.   The  districts  identified  by  Joan  as  possibly 
having  materials  were:   Greenwich,  West  Hartford,  East  Hartford,  Manchester, 
South  Windsor,  and  Stamford.   At  the  present  time  RMC  does  not  view  these 
communities  as  a  likely  source  of  items.   Should  MDE  desire  further  information 
about  these  sources,  however,  the  above  communities  will  be  contacted. 

SCORE  (Riverside  Publishing  Co.).   Early  in  this  contract,  RMC  contacted 
Riverside  Publishing  Company  (a  division  of  Houghton  Mifflin)  to  discuss  the 
SCORE  item  bank.   A  meeting  was  held  February  19,  1980,  in  Portsmouth,  NH,  to 
discuss  how  SCORE  might  prove  useful  to  Massachusetts,  and  other  issues  related 
to  setting  up  item  banks.   As  a  consequence  of  that  meeting,  another  meeting 
was  held  with  representatives  from  Riverside,  the  Bureau  and  RMC  on  March  6.  At 
that  meeting,  details  were  discussed  concerning  Massachusetts'  interest  in 
obtaining  rights  to  use  SCORE.   The  Bureau  is  awaiting  a  response  from 
Riverside. 

Rhode  Island.   RMC  visited  the  University  of  Rhode  Island  to  find  out 
about  the  computerized  testing  system  operated  there.   As  part  of  the  Local 
Planning  and  Assessment  Process  (LPAP) ,  the  Rhode  Island  Department  of  Educa- 
tion has  provided  an  objective  and  item  bank  to  local  school  districts. 
Districts  participating  in  LPAP  are  expected  to  put  together  tests  using  the 
item  bank  as  a  resource.   To  facilitate  this  process,  the  Test  Service 
Center  has  been  established.   Through  it,  local  districts  can  request  customized 
tests  and  scoring  services.   RMC  obtained  sets  of  illustrative  materials  during 
the  visit  and  showed  them  to  districts  in  Massachusetts  as  part  of  the 


interview  process,  presenting  the  Rhode  Island  system   as  one  model  of  how  a 
computerized  system  might  operate. 

Availability  of  Item  Pools 

As  part  of  the  initial  phase  of  the  feasibility  study,  RMC  contacted 
many  owners  of  item  pools  to  determine  their  relationship  to  the  Basic  Skills 
objectives,  whether  the  pools  were  available  for  use  in  Massachusetts,  and 
if  so,  what  the  costs  would  be.   The  starting  point  for  this  search  was 
Michael  Hiscox  of  the  Northwest  Regional  Educational  Laboratory,  author  of 
a  paper  entitled,  "Item  Banks — Where  Are  They?"   Beginning  with  this  lead,  we 
uncovered  a  few  banks  that  might  be  worthwhile  pursuing. 

One  possible  source  of  items  would  be  the  Sample  Assessment  Exercise 
Manual  (SAEM) ,  published  by  the  California  State  Department  of  Education. 
Twenty  nine  of  the  38  Basic  Skills  objectives  in  mathematics  are  directly 
matched  by  objectives  to  SAEM.   The  correspondence  of  Basic  Skills  objectives 
and  SAEM  objectives  is  less  direct  in  reading,  but  it  is  likely  that  many  of 
the  items  in  SAEM  measure  the  Basic  Skills  objectives.   SAEM  materials  are  not 
copyrighted,  and  according  to  Dr.  William  Padia,  who  directed  the  project,  are 
available  for  free. 

Another  potential  source  of  items  would  be  the  Fountain  Valley  Teacher 
Support  System.   That  pool  has  items  that  match  all  the  Basic  Skills  objec- 
tives in  reading,  and  12  of  the  objectives  in  mathematics  (including  4  not 
covered  in  SAEM).   An  advantage  of  the  Fountain  Valley  materials  is  that  an 
entire  teacher  support  system  could  be  purchased  along  with  the  items,  thus 
facilitating  remediation.   The  Fountain  Valley  materials  are  managed  by 
Richard  L.  Zweig  Associates  of  Huntington  Beach,  California.   They  are  willing 
to  entertain  negotiations  for  the  purchase  of  items. 

The  Northwest  Evaluation  Association  of  Tacoma,  Washington,  has  developed 
a  bank  which  they  have  sold  to  at  least  one  public  agency.   How  well  their 
items  match  the  Basic  Skills  objectives  is  unknown  at  this  time. 

Los  Angeles  County  Schools   have  a  large  bank.   Dr.  John  Martois  visited 
Massachusetts  on  December  10,  1979,  to  discuss  the  bank,  but,  at  this  point,  Los 
Angeles  is  not  considering  selling  or  leasing  its  pool  for  use  in  Massachusetts. 
It  is  possible,  however,  that  they  could  be  persuaded  to  reconsider. 

There  are  three  other  potential  sources  of  pools  on  which  less  is  known. 
The  Belmont  Standards  Test  System  is  still  being  field  tested  by  Belmont, 
California.   The  director  of  that  project  recently  has  been  hospitalized  and 
progress  has  temporarily  halted.   However,  it  is  possible  that  work  will  be 
finished  and  the  pool  will  be  available  for  purchase.   The  Alaska  Objectives  and 
Item  Bank  is  purported  to  be  a  bank  of  high  quality,  containing  almost  2,000 
items  in  reading  and  1,600  in  math.   However,  RMC  has  not  received  a  response 
to  its  inquiry  and  will  have  to  follow  up  on  its  initial  contact.   Another 
promising  lead  is  the  bank  at  the  Institute  for  Educational  Research  in  Glenellen, 
Illinois.   RMC  learned  of  its  existence  shortly  before  the  production  of  this 
report  and  contact  with  that  group  has  not  yet  been  made. 


6 


Contact  with  commercial  publishers  has  been  discouraging.   While 
many  are  willing  to  custom-design  tests  for  school  districts,  only  Riverside 
Publishing  Company  expressed  a  willingness  to  sell  its  pool  to  Massachusetts, 
However,  it  remains  to  be  seen  what  would  happen  if  Massachusetts  decided  to 
issue  a  Request  for  Proposals.   It  seems  probably  that  some  publishers  would 
respond  to  an  invitation  to  sell  a  pool  of  items  for  limited  use  within 
Massachusetts  for  a  reasonable  cost. 


III.    SELECTING  AN  INTERVIEW  SAMPLE  AND  CONDUCTING  THE  INTERVIEWS 

From  the  very  beginning  of  this  project,  RMC  has  maintained  that  alter- 
native approaches  to  basic  skills  testing  must  be  evaluated  in  light  of  local 
district  opinion.   We,  therefore,  proposed  that  a  random  sample  of  eight 
districts  be  interviewed  early  in  the  feasibility  study  to  help  narrow  the 
list  of  alternatives  being  considered.   These  interviews  were  not  intended  to 
yield  definitive  statistics  on  the  popularity  of  various  alternatives,  but 
were  viewed  as  a  means  of  obtaining  local  district  concerns.   It  was  expected 
that  the  comments  of  local  district  personnel  would  provide  us  with  a  sense 
of  which  factors  would  be  important  in  selecting  among  alternatives,  rather  than 
dictate  which  system  should  ultimately  be  adopted. 

Selecting  a  Sample 

To  solicit  a  reasonably  broad  range  of  opinion,  RMC  chose  two  districts 
from  each  of  four  different  kinds  of  community:   big  city,  industrial,  suburb, 
residential  suburb,  and  other.   Witnin  each  of  these  categories  we  selected  one 
district  which  participated  in  the  1978-79  Massachusetts  assessment  of  Basic 
Skills  under  the  "Local  Option,"  and  one  district  which  did  not  participate. 
In  that  way,  we  hoped  to  obtain  the  views  of  those  with  no  experience  in  using  a 
"state  test,"  as  well  as  opinions  influenced  by  the  previous  year's  testing  experience, 

Within  each  category  districts  were  selected  by  means  of  a  random  number 
table  from  lists  provided  by  the  Department.   Districts  in  the  "Local  Option 
Participant"  category  were  limited  to  those  vchicb  had  agreed  tc  have  their 
names  released.   This  randomly  selected  sample  was  then  presented  to  the 
Review  Committee  for  comment  at  its  meeting  on  October  11,  1979.   After  some 
discussion  the  committee  recommended  that  the  district  selected  for  the 
"Participant-Other"  category  be  replaced  with  a  regional  vocational  school 
qualified  for  that  category,  and  it  was  added  to  the  sample.   The  Review 
Committee  also  recommended  that  the  district  selected  for  the  "Non-participant- 
industrial  Suburb"  category  be  replaced  with  an  industrial  suburb  in  western 
Massachusetts.   Two  such  districts  were  suggested,  and  selection  was  made  by 
the  flip  of  a  coin. 

The  Districts  Selected 

The  names  of  the  eight  school  districts  that  RMC  visited  will  not  be  revealed 
in  this  report.   However,  readers  will  have  a  better  context  in  which  to  under- 
stand the  comments  and  reactions  of  the  people  visited  who  know  something  about 
the  districts  and  the  staff  from  those  districts  who  attended  our  meeting.   For. 
this  reason,  capsule  summaries  of  the  eight  districts  are  provided: 

District  A  was  a  regional  school  district  located  near  the  1-495  loop. 
Attending  the  meeting  were  the  regional  superintendent,  superintendents  of 
the  feeding  elementary  districts,  and  middle-level  administrators  from  the  high 
school.   To  District  A,  an  overriding  consideration  was  concern  that  the  results 
would  be  used  to  unfairly  compare  the  feeder  districts  to  each  other,  and  the 
regional  as  a  whole  to  neighboring  districts.   At  one  point  in  the  meeting,  the 
opinion  was  expressed  that  they  would  have  preferred  that  the  Department  of 
Education  had  never  put  out  a  test  at  all.   But,  since  they  could  not  afford  to 


8 


develop   their   own  test,    they  were  going    to  use  whatever   the  Department 
provided.      They  were  concerned   that   the  Basic   Skills  program  might    evolve 
into  a  program  that  required  all  districts   to  use  one  test,    that   standards 
would   be   set   by   the   state,    and   that   passing   the  test  would   become  a  gradua- 
tion requirement. 

District  B  was  a  medium-sized   school  district   located  just   outside  a 
major  urban  area.      Attending   the  meeting   were  the  district    superintendent  and 
four   principals.      The  attitude  toward   the  current   program  was  positive,    and 
all  present  were  receptive  to  other  approaches.      They  were  content   to   follow 
the  leadership  of    the  state;    they  felt   that   the  public   had   a  right   to   know 
what  was  going   on   in  the  schools.      They  were  quite    sophisticated    in  their 
knowledge  and  use  of  modern  technology.      For   example,    they  had   a  computer 
terminal   in  their   school  which  was  used   regularly  by  students,    and   the  princi- 
pal was  proud   to   provide  a  demonstration. 

District   C  was  a  medium-sized  district   serving   a  fairly  wealthy  bedroom 
community.      The  meeting   was  attended   by  a   large  group   comprised  mostly   of 
teachers  and   principals.      District   C  currently  has  an  extensive  testing 
program  in  place,    and  needs   little,    if   any,    assistance  from   the   state.      Their 
reactions  to   the  issues  were  driven  by  a  desire  to  have  the  Department   in- 
stitute a  tightly  run  program  that  would  maximize  the  comparability  of    test 
results  from  district   to  district.      They  believed  very   strongly   that   the  pro- 
gram would   become  "a  farce"   if   comparisons  could  not  be  made.      They  were  also 
confident   that   if    one  statewide  process  and   standard  were  adopted,   none  of 
their   students    (except   for  perhaps  a  few  in  special   education  classes)    would 
fail. 

District  D  was  a  large  school  district    serving   a  blue-collar   community 
in  the  eastern  part  of   the  state.      A  variety  of    staff   attended   the  meeting, 
but  all  were  involved    in  their  Basic   Skills  Assessment  program  in  some  way. 
District  D   is  developing    its   own  test  for  Basic   Skills  and    is  well  committed 
to  that  approach.      As  a  consequence,    they  had   less  concern  than  did   other 
districts  as   to  what   the  Department   should  do. 

District   E  was  a  medium-sized  district   serving   a  blue-collar  community 
in  the  western  part  of   the  state.      The  meeting  was  attended   by  three  district 
administrators.      They  wanted   a   system   that  would   have  low  cost  and   be  easy 
to  administer.      They  were  concerned   that   the  Department  might  move  to  a 
sophisticated    system,    and  were  afraid   that   they  might  be  growing    too  dependent 
on  a  few  experts  for  advice.      They  felt   the  State  was  moving   too  fast;    that 
while  they  were  ahead  of   schedule  right  now,    they  would   be  hard  pressed   to 
keep  up  with  further   change. 

District  F  was  a   small  regional   school  district    in  the  western  part  of 
the  state.      Attending    the  meeting   were  the   superintendent,    high  school  princi- 
pal,   elementary  principal  and   the  chair  of   the  assessment   committee.      These 
representatives   of  District  F  had   the  most   sophisticated  view  of    testing  RMC 
encountered    in  the  eight  districts.      They  pointed   out   several  consequences  of 
the   issues   that  RMC  had  not  considered   previously .      In   spite  of   their   sophis- 
tication,   it  was   likely  that   they  would  use  whatever   the  Department   offered, 
since  they  did  not  have  the   staff   or  financial  resources   to  go  much  beyond   that 


9 


District   G  was  a  vocational   regional   system    in   the  western  part   of 
the   state.      The  meeting  was  attended   by   the  assistant    superintendent   plus 
six   teacher-leaders.      This  district,   which  on   the  surface  appeared   to  be 
quite  different   from  the  others   RMC  visited,    had   many   similarities   to 
District   C.      As    in  District   C,    the  opinions   of    the   staff    in  District   G 
were  driven  by   the  concern   that   the  test   provided    this  past   year   by   the 
Department   was   too    easy,    a    statewide   standard   was  needed    to    "put    some 
teeth"   in  the  program,    and    testing   time,    for   this  purpose,   needed   to  be 
held   to   a  minimum.      It  may  be  coincidental   only,    but    it    is  worth  noting 
that  Districts  C   and  G,    the  only  two   to   invite  a   large  number   of   teachers 
to   the  meeting,    were  the  two  districts   that   expressed   the  strongest  desire 
to   establish  a   statewide  test  with  statewide  standards. 

District  H  was  a  large  urban   school  district.      RMC  visited  with  the 
staff    in  two   schools   in  the  district.      In  both  cases,    the  meeting  was 
attended   by  the  headmaster  and   two  assistant   headmasters.      One  school  also 
invited   two  additional  teacher-leaders.      The  district   has  a  criterion- 
referenced   testing    system   that   accommodates  most   of    the  test   information 
needs   of   people   in  the  schools;    consequently,    the  statewide  program  was 
viewed  as  an  add-on   system.      In  both  schools   there  seemed   to   be  an   impatience 
to   "get  going."      Staff   wanted   the   state  to   establish  a  program   that   could  be 
used   to  focus   efforts   on   improving   basic   skills. 

Conducting    the  Interviews 

Each  district   in  the   sample  was  first  contacted  by  the  Department   regard- 
ing   its  willingness  to  participate.      Thereafter,    RMC  contacted   each  district 
by  phone  to  make  arrangements  for   the   interviews.      It  was  recommended   that 
each  district   invite  the  members   of    its  basic   skills   committee  to  attend,    and 
that  regional  districts   invite  feeder   committees   to  participate.      Prior  to 
the  interview,    each  district  was  mailed   five  copies  of   a  document    entitled, 
"Issues  Related   to   the  Dissemination  of   Tests   by  the  Massachusetts  Department 
of   Education  for   the  Basic   Skills  Improvement  Policy."     This  paper  discussed 
seven   issues  related   to   how  the  Department   could   best  provide  test  materials 
to   local  districts.      While  this  paper  was   intended   to   serve  as   the  basis  for 
discussion  during   the   interviews,    local  districts  were  also   encouraged   to 
raise  issues  not  discussed    in  the  paper.     A  revision  of    this  paper  appears   in 
Appendix  B. 

All    interviews  were  conducted   between  November   6  and  December  3,    1979. 
Most   lasted   between  two  and   three  hours.      Interviews  were  attended   by  Drs. 
Hill  and  Lyczak  from  RMC  and   by  at   least  one  staff  member  from  the  Bureau  of 
Research  and  Assessment.      The  format  of   the   interviews  was  discussed  and 
approved   by  the  Bureau    in  a  meeting  with  RMC  on  November   1,    1979.      Each  inter- 
view began  with  an   introduction  to   the  project   by  a  Bureau   staff  member.      Drs. 
Hill  and  Lyczak  then  led  a  general  discussion  of    each  issue  discussed    in   the 
paper.      When  all  of    the   issues  had   been  discussed,    an   informal   poll  was   taken 
in  order   to   summarize  the  opinions  of    the  interviewees.      While  this  poll   took 
the  form   of   a   separate  vote  on  each  issue,    its  purpose  was  to  assist  RMC   in 
characterizing   concerns  of    the  district  rather   than  to  determine  the  course  of 
basic   skills  assessment  by   tallying   responses.      All  responses,    of   course,    remain 
confidential.      However,    some  districts   requested,    and   were  given,    a   brief 
synopsis   of    how  other  unnamed   districts    in   the   sample  had   responded    to    the   issue 


10 


IV.      FEEDBACK  ON  THE   ISSUES 

During   the  months   of   November   and  December,  1979,    RMC   staff  visited  eight 
school  districts.      At   each  meeting,    the   issues  discussed    in  Appendix   B 
were  reviewed  with  representatives   of    the  districts.      The  purpose  of    this 
chapter   is   to  report   on  the  feedback  received  on  each  of    the   issues. 

There  are  seven  basic    issues  that  were  discussed.      In  this   section  of 
the  report,    each   is  presented    in   turn  as  a   question.      The  first   paragraph 
of    the  discussion    briefly  describes   the   issue.      All   subsequent  paragraphs 
related   the  comments   that  were  made  on  the   issues.      One  additional   issue 
was  raised   at   two  of   the  meetings,    and   that    issue  is  discussed    in  the  next 
section   of   this  report. 

Issue  #1    (A) :      Should    the  Department  release  a   short   test   or  a   long   test? 

Under   the  Basic   Skills  Improvement  Policy  districts  are  obliged   to    (1) 
set   standards  and  report   how  many   students  met   those   standards,    (2)    diagnose 
individual  needs,    and    (3)    assess   the  local  curriculum.      A  short    (50-60   item) 
test  would   adequately   serve  the  first   of    these  three  functions,    but  would 
provide  little  diagnostic   information  and   only  fragmentary  evidence  about 
the  effectiveness   of   a  district's  curriculum.      To  accomplish  these  latter 
tasks  as  well,    a   longer    (200-300   item)    test  would   be  required.      The  question 
is  whether  or  not   the  state   should   provide  districts  with  this   longer,   more 
versatile  test,    so   that  all   three  requirements  could   be  satisfied  using   a 
single  instrument. 

The  most  frequently  heard  comment   on  this   issue  was   that  districts  al- 
ready had  diagnostic   and  curriculum  evaluation  procedures   in  place,    so  a 
longer   test  would  be  redundant.      Furthermore,    a  district's  curriculum  con- 
tains  instruction   in  areas  other   than  the  basic   skills.      A  long   test   issued 
by  the  state  would    focus   only  on  basic  skills   and   could  not   possibly 
deal  with  the  uniqueness  of    each  district's  curriculum. 

A  substantial  majority  of   districts   in  the  sample  also  felt   that  a  long 
test  would   tax   the  endurance  of  minimally  competent   students,    thereby  foster- 
ing guessing   and   resulting    in  less  reliable  measurement   than  a   short   test. 
Also  mentioned   was   the  fact   that  diagnostic    information   is  needed   only  for 
students  who  do  not  pass   the  competency   standard.      In  most   cases   this  would 
be  a  very  small  number — too   small   to  justify  the  development   of  a   long   test. 

Many   interviewees  felt   that   the  state  should  not  be  competing  with 
commercial    test  publishers,    especially   in  the  area  of  diagnostic   instruments. 
They  expressed  reservations  about   the  expense   of  developing    such  a   test,    and 
doubted   that  a   state  test   could  rival   the  quality   of   existing  commercial    tests 
It  was  felt   by  many   that   in   trying   to   satisfy  the  needs  of   all  districts,    a 
long   test  produced   by  the  state  would   have  to  be  "watered  down."     Its  content 
would   be  so  general   that   it  would   not   satisfy  a  district's   information  needs. 

A  significant  minority   of  districts   in  the  sample  perceived   a  long    test 
as  an  attempt  by  the  state  to   establish  a   statewide  curriculum.      If   a   long 
test  were  created   to   evaluate  curriculum,    then  the  content   of    that   test  would 
end  up  defining    the  curriculum.      Furthermore,  most  districts   see  minimum 
competency  requirements  as   the  heart  of   the  Basic   Skills  Policy.      "That   in 


11 


itself    is   a  monumental    task"   without    the    state    imposing   diagnostic    and 
curriculum    evaluation   requirements    at    the   same    time. 

One   district,    which  uses   diagnostic    testing    throughout    the    elementary 
and    secondary   years,    said    that   a    state  diagnostic    test    would   disrupt    their 
current    system.      If    there  were  pressure   applied    to   use   the    state  diagnostic 
test   at   grade  nine,    longitudinal   comparisons   could   not    be  made  with  diagnos- 
tic   testing   done    in   previous   years. 

Several   people  were  concerned    about   a    single  diagnostic    instrument   being 
used    in  districts   with  different  minimum   standards.      They   felt    that    the 
nature   of    the  diagnostic    instrument    should   be  determined   by   the   level   of    per- 
formance demanded    in   the  district's  minimum    standards. 

Among    those  who   supported    the   idea   of    a   longer    test,    comments   centered 
on   the   efficiency  of   using    a   single   test   rather    than   three   separate   tests 
to   fulfill   Basic    Skills    obligations.      One  regional   district   felt    that   a   long 
test  might   force  feeder   districts   to    standardize   their   curricula.      Another 
district  favored    a    long    test   because   it   felt   an   obligation   to  gather   all   the 
information   it   could   get    on   student   performance.      On   the  whole,    however,    we 
found  little   support   for  a   long    test.      Most  districts  felt   that   a   short   test 
would   adequately   satisfy   their   needs,    while  at    the    same   time  minimizing   costs 
to    themselves   and    to  the   state. 

Issue   #1    (B) ;      Should    the  Department   release  both  a    short    test    and   a   long 

test? 

The  assumption  underlying  this  issue  is  that  different  districts  have 
different  needs.   While  some  may  already  have  diagnostic  and  curriculum 
evaluation  procedures  in  place,  others  may  be  looking  to  the  state  for 
materials  to  satisfy  these  latter  two  requirements  of  the  Basic  Skills  Improve- 
ment Policy.   The  question  is  whether  the  state  should  provide  both  a  short 
test  for  the  purpose  of  establishing  minimum  competencies,  and  additional  sets 
of  items  (long  test)  for  the  purpose  of  diagnosis  and  curriculum  evaluation  in 
districts  which  elect  to  use  them. 

Most  of  the  people  we  interviewed  had  no  objection  to  the  state  supplying 
a  long  test  to  districts  which  chose  to  use  it.   Some,  however,  expressed  the 
reservation  that  political  pressure  within  the  district  might  force  them  to 
use  the  long  test  when  they  preferred  the  short. 

One  repeated  concern  was  that  the  system  not  be  too  complicated.   Districts 
did  not  want  to  be  bothered  extracting  short  test  items  from  the  long  test  and 
were  worried  that  administration  and  scoring  in  such  a  system  would  be  unduly 
complex.   There  was  also  concern  expressed  by  some  that  by  providing  districts 
so  many  options  the  state  was  making  statewide  comparisons  difficult.   This 
group  favored  a  single  statewide  test,  be  it  long  or  short,  as  well  as  state- 
wide minimal  standards.   On  the  positive  side,  one  district  saw  the  development 
of  an  optional  long  test  as  a  way  of  reducing  "windfall  profits"  among 
coiuinercial  publishers. 

While  the  majority  of  districts  favored  a  short  test,  there  was  little 
opposition  to  having  the  state  provide  a  longer  version  for  those  who  wanted  it 


12 


Those  opposing    such  a   solution  were  mainly  concerned   about   the   state 
trying    "to   be  all   things   to  all  people." 

Issue  #2;      Should   the  Department   release  multiple  forms  of   each  test? 

Obviously  the  more  forms   of   a   test   the  state  provides,    the  more  expen- 
sive  test  development   will  be.      The  basic    issue  here   is  whether   local  dis- 
tricts feel   strongly   enough  about   having  multiple  forms   to  warrant   the  extra 
expense  of   producing   them.      Some  of    the  factors  which  districts  were  asked 
to  consider    in  determining    their  needs  were  test   security,    the  desire  for 
choice,  multiple  administrations,    and  comparability  among  districts. 

Among  districts  which  favored    the  release  of  multiple  forms,    the  most 
frequently   stated  need  was  for   subsequent   retesting   of    students  who   had 
failed.      Those  favoring   a   single  form,    however,    argued   that  retesting   could 
be  done   the  following   year  ^ith  that   year's   single  form. 

The  other   compelling   argument  for   issuing  multiple  forms  was   that   stu- 
dents from  adjacent  districts,   who  were  tested   on  different  dates,    could 
exchange   information  about   the  content   of    the  test.      This  point  prompted 
the  comment    in   several  districts   that   the   state  should   establish  a   statewide 
testing  date.      If    everyone  were  tested   on  a   single  date,    these  people  argued, 
a   single  form   of    the  test  would   suffice.      As  one  district   astutely  observed, 
however,    there  would  always  be  absentees   to  deal  with.      The  only  question   is 
whether  or  not    it    is  worth  developing   a   separate  form  for   the  testing   of 
absentees,    assuming   that  all  absentees  could   take  the  makeup   test  at   the 
same  time.      Other  districts  felt   that   the  problem  of    students  passing   on 
questions   to   their  classmates  would  not  be  a  problem  because  test   questions 
were  not  referenced   to   text  material  which  could   be   studied;    and   also  because 
the  consequences  of   failing   the  test  were  not   severe  enough  to  motivate  them 
to  make  extra-ordinary  efforts   in  an  attempt   to  pass   it.      One  district  also 
cited   the  use  of  multiple  forms  as  a  way  of   preventing    teachers  from  teaching 
to   the  test. 

There  was  no  demand   for  more  than  two   equivalent   forms  of   the   test   each 
year.      The  only  perceived  need  for  more  than  two   forms  would   be  to  accommodate 
the  problems   of  makeups  and  administrations  on  different   dates  mentioned    in 
the  paragraph  above.      Since  this  problem  conceivably  could  require  that  a  great 
number   of   forms  be  made  available  before   it  would   be  solved,    the  solution  will 
have  to   come  from  a   source  other   than  multiple  forms.      Consequently,    the  Depart- 
ment  should   plan  on  producing  no  more  than  two  forms   each  year. 

In  discussing   the  choice  which  multiple  forms  would   offer,    one  district 
questioned   how  much  freedom  they  would   have  to   interchange   items  across   equiva- 
lent forms.      Permitting  districts   to   exchange  less  desirable  items  on  one  form 
for  more  desirable   items   on  another ,    they  reasoned,  might   be  a  way  of  avoiding 
confrontations  over   specific    items. 

The   issue  of    comparability  was  raised   by  both  the  advocates  and   opponents 
of  multiple  forms.      Those  favoring  multiple  forms   tended   to  be  those  who 
supported   local  autonomy  and  wished   to  avoid    statewide  comparisons.      Those 


13 


favoring  a  single  form  tended  to  be  those  who  were  seeking  comparisons 
and  desired  a  statewide  minimum  standard.   While  the  use  of  different 
forms  does  not  rule  out  comparisons  across  districts,  it  was  clear  to 
both  groups  that  releasing  a  single  form  would  facilitate  comparisons. 

Issue  #3  (A) :  Should  the  Department  release  its  test  as  an  item  pool? 

As  an  alternative  to  releasing  a  state  test,  it  is  possible  that  the 
Department  could  release  a  large  bank  of  items  and  allow  districts  to  con- 
struct their  own  test.   Each  item  in  the  pool  would  be  related  to  an  objec- 
tive in  the  Basic  Skills  Improvement  Policy  and  would  be  accompanied  by 
statistics  indicating  its  difficulty.   Such  a  system  would  bypass  the  issue 
of  security  because  the  item  bank  would  be  too  large  for  students  to  memorize. 
Since  different  combinations  of  items  would  be  selected  for  use  each  year,  a 
single  item  bank  could  provide  appropriate  tests  for  several  years.   In 
addition,  items  could  be  added  to  the  bank  to  extend  its  life.  Most  importantly, 
however,  an  item  banking  system  permits  districts  to  customize  a  test  to  their 
own  curriculum,  their  own  minimum  standards,  and  their  own  informational  needs. 
Under  such  a  system  it  is  likely  that  each  district  would  develop  a  unique 
assessment  instrument. 

Most  districts  viewed  an  item  pool  system  as  too  sophisticated,  complex, 
and  expensive.   Among  their  major  concerns  were  the  lack  of  local  expertise 
in  test  construction,  and  the  lack  of  financial  resources  to  support  a  test 
development  effort.  With  limited  resources  and  personnel  available,  districts 
were  also  concerned  that  the  product  they  produced  might  be  deficient  in  some 
way  and  thereby  expose  them  to  criticism  and /or  a  legal  liability. 

Among  the  technical  considerations  which  concerned  interviewees  was  the 
problem  of  equating  forms  of  the  test  which  they  constructed.   Of  foremost 
concern  were  problems  associated  with  having  the  difficulty  of  the  test  vary 
from  one  year  to  the  next.   Districts  wanted  to  know  whether  the  test  they 
constructed  from  the  pool  would  have  to  be  approved  by  the  state.  Many  were 
worried  that  some  districts  would  choose  all  easy  items  and  therefore  look 
artifically  better  than  districts  who  constructed  a  harder  test.   To  overcome 
this  problem  it  was  suggested  that  the  state  could  establish  some  guidelines 
regarding  the  minimum  number,  and  average  difficulty,  of  items  to  be  chosen 
for  each  objective.   This  approach  would  also  allay  the  fears  of  those  who 
felt  local  test/evaluation  experts  would  have  too  much  power  in  determining 
the  nature  of  the  test. 

One  very  strong  sentiment  expressed  by  several  districts  was  that  providing 
districts  with  the  flexibility  of  an  item  pool  made  little  sense  when  objectives 
were  already  set  by  the  state.   The  only  choice  left  to  the  districts  was  what 
items  would  be  used  to  measure  the  objectives.   They  felt  that  since  the  state 
chose  the  objectives,  the  state  might  as  well  choose  the  items  too.   Furthermore, 
they  argued,  basic  skills,  by  definition,  are  supposed  to  be  "universal."  If 
basic  skills  are  a  common  denominator,  a  district  cannot  maintain  that  it  is 
unique  and  requires  a  customized  test.   If  a  school  district  does  want  a 
customized  test,  it  could  always  write  its  own  rather  than  use  the  state  supplied 
materials.   Several  people  wondered  whether  the  state  would  keep  a  record  of 
which  items  districts  used  from  the  item  bank. 


14 


Regional   districts   were   fearful    that    the    item   bank   concept   would    en- 
courage   too   much  diversity   among    their    feeder   districts.      They   felt    that 
more  uniform    tests   and    standards   would   guarantee   them    a   more   uniform    level 
of    competency   among    entering    students. 

Among    the  many  positive  comments  made  were   the    importance  of    a  district 
feeling    ownership   for    its   test,    and    the  guidance  which  an    item   bank  would 
provide   a  district    in  reviewing    its   basic    skills   curriculum.      One  district 
was    interested    in   the   possibility   of  using    items   from   the  bank  for   classroom 
testing . 

Issue  //3    (B)  :      Should    the  Department   release  both  an   item  pool  and   an  annual 

test   constructed   from   the   item   pool? 

The  majority   of   districts   fell    into   two   categories:      those  which  desired 
autonomy  and    the  flexibility   of   creating    their   own   test   and    those  which  pre- 
ferred   the  convenience   of   using   a    state   supplied    instrument.      In   order   to 
accommodate  both  camps,    a   hybrid  version   of    the   item   bank  concept   was    suggested 
which   includes   releasing    items  and   also   providing    a    suggested    test.      Since  the 
test  would    be  constructed   from   the   item  pool,    districts   would   be  free   to   re- 
place  items   they  did   not    like   or    simply  adopt    the   test    "in   toto." 

The  response   to    this    system  was  almost    entirely   favorable.      Districts 
which  were   interested    in  developing   a  customized    test   felt  more  comfortable 
working    from    a  "model",    and  districts  which  wanted    the   state   to   issue  a   test 
saw  the  annual    sample   test   as  fulfilling    their   needs.      Only   a    small  number   of 
respondents    opposed    such  a    system   on   the  grounds   that    it   allowed   districts   too 
much  freedom.      Freedom   to   "tamper"  with  the   state   test   by   replacing    items  was 
seen  as   inhibiting    the  development    of    statewide  minimal    standards,    even   though 
the   state  currently  has  no  plan   to    impose   such   standards.      One  district    suggested 
that   the   state  replace  unacceptable   items  for   a  district   rather   than  release  the 
whole   item  pool  and   allow  the  district    to  do    it. 

There  was   also    some  concern  among   districts  which  favored    such  a    system 
where    security  would  be   a  problem  for   dist    ricts   employing   the   suggested   state 
test.      Once  a  district  had  used    the   suggested    test,    students    in  neighboring 
districts  could   obtain   information  about    its   contents.      There  was  also   concern 
that    if    the  entire   item  pool   ever  got   out,    private  entrepreneurs  would    set  up 
schools  or  develop  materials   to  coach   students   for   the  test.      For   that   reason, 
one  district    suggested   that   the  state  release  only  a  portion  of    the   item  pool 
each  year.      This   strategy  would   allow  districts   some  flexibility   in  choosing 
items,    while  maintaining    the   security  of    the   item  bank  as  a  whole. 

Issue  #4:      Should   the  Department  provide  a   computerized   testing    service? 

An  alternative   to    the  release  of    an   item   pool  would   be  a   computerized 
testing    service.      Under    such  a   system,    school  districts  would   receive  a  booklet 
containing    the  objectives   available   to   be   tested    (or   alternatively,    the   items 
available   in   the  bank) ,    and   complete  an  order   form  requesting   a   test   designed 
to   their    specifications.      A  central  agency  would   receive  the  order 9    produce 
the   test   requested,    and   return  a   camera-ready   copy  of    the   test    to   the  district. 
In    systems   of    this   type,    scoring    services   are  usually  provided   by   the  test 
production  center. 


15 


Districts  had  two  major  objections  to  this  system:   it  seemed  un- 
necessarily cumbersome,  and  it  appeared  likely  that  the  system  would  cost 
the  districts  money  to  participate.   The  Basic  Skills  Improvement  Policy 
has  a  limited  number  of  objectives,  and  those  objectives  appear  to  be 
readily  accepted  by  the  people  RMC  interviewed  as  universal  and  minimal. 
Given  that  everyone  would  be  using  the  same  objectives  on  their  test,  there 
was  real  question  as  to  how  "customized"  the  total  test  would  be.   If  the 
people  interviewed  had  felt  that  they  needed  more  individualization  in  their 
tests,  it  is  likely  that  a  computerized  system  was  not  foreign  to  most  of  the 
people  interviewed;  they  were  familiar  with  commercial  efforts  in  this  area 
such  as  SCORE  and  ORBIT.   Both  SCORE  and  ORBIT  generate  customized  tests  from 
a  computerized  item  bank.   Their  hesitancy  to  support  a  computerized  system 
was  specific  to  this  particular  application.   They  felt  that  for  Basic  Skills 
objectives,  such  a  system  was  overly  elaborate. 

It  was  clear  that  the  idea  of  having  a  system  that  the  districts  would 
have  to  pay  for  was  definitely  unpopular,  no  matter  how  little  the  costs. 
In  a  few  of  the  meetings,  the  first  question  asked  about  the  computerized 
system  was  whether  it  would  cost  to  participate.   Unless  the  Department 
could  support  such  a  system  on  its  own  budget  and  make  the  tests  available 
to  districts  with  no  direct  charges  being  paid  by  them,  this  system  could 
expect  to  have  many  strong  opponents. 

To  a  much  lesser  extent,  there  was  some  concern  over  whether  a  computer -= 
ized  system  would  provide  the  Department  with  too  much  control  over  what 
districts  were  doing.   Fears  about  such  control  were  expressed  by  only  a 
few,  and  their  fears  did  not  seem  deep-set.   Nonetheless,  such  concerns  were 
expressed  at  more  than  one  location. 

Some  districts  were  positive  about  such  a  system.   One  district  already 
had  made  a  commitment  to  use  SCORE,  and  thus  had  a  somewhat  positive  attitude 
toward  this  system.   Another  had  a  similar  system  operating  in  the  math  depart- 
ment of  their  high  school.   They  recommended  that  the  system  be  provided  to 
each  regional  center  in  the  state  and  operated  by  those  offices.   They  also 
saw  advantages  to  having  all  the  technical/statistical  aspects  of  testing 
being  provided  by  machine. 

On  balance,  however,  it  was  clear  that  most  districts  had  strong  objec- 
tions to  a  computerized  system.   While  such  a  system  might  meet  with  favor 
for  another  application,  there  seemed  to  be  little  need  for  such  sophistica- 
tion for  a  program  in  basic  skills. 

Issue  #5;   Should  the  Department  provide  comparative  data? 

The  issue  of  whether  the  Department  should  make  it  possible  for  districts 
to  compare  their  results  to  statewide  averages  was  one  on  which  opinion  was 
sharply  divided.   In  several  of  the  districts  RMC  visited,  there  was  a  strong 
feeling  that  not  only  should  there  be  comparable  data,  but  that  only  one  test 
should  be  given  statewide  and  one  standard  should  be  set  by  the  Department. 
These  people  felt  that  failure  to  do  so  would  weaken  the  Basic  Skills  Improve- 
ment Policy  and  make  it  unfair.   On  the  other  hand,  there  were  those  who  were 
deeply  concerned  that  no  comparisons  be  made.   Some  expressed  the  opinion  that 
they  would  be  happiest  with  a  system  that  prevented  any  comparability  of  results 
even  between  schools  in  the  same  district. 


16 


Those  who  favored  comparability  were  divided  into  two  camps — those 
who  wanted  it,  and  those  who  were  willing  to  accept  it.   Those  who  wanted 
it  generally  saw  a  need  to  have  one  standard  across  the  state,  both  to 
focus  attention  on  the  program  and  to  equalize  the  standard  across  the 
state.   They  tended  to  believe  that  the  program  should  be  a  strong  force 
for  improving  education  in  Massachusetts,  and  felt  it  would  not  meet  this 
objective  if  each  district  could  establish  its  own  standard.   Those  who 
were  willing  to  accept  comparability  saw  it  happening  now  on  several  other 
programs.   They  felt  that  such  comparisons  had  been  made  so  frequently  that 
they  no  longer  had  much  impact  in  terms  of  helping  the  district  to  see  how 
it  was  doing.   Many  saw  a  growing  need  for  comparability  because  of  the  in- 
creased mobility  in  America,  i.e.,  since  students  were  moving  around  so  much 
now,  increased  standardization  in  all  aspects  of  education  was  taking  on  a 
new  importance.   An  opinion  was  expressed  that  it  seemed  dishonest  to  let  a 
student  think  he  had  achieved  minimum  competency,  and  then  to  have  him  find 
out  he  is  not  competent  in  his  new  community. 

On  the  other  hand,  some  people  opposed  a  common  standard  because  they 
felt  that  the  common  standard  would  be  set  too  low.   They  feared  that  the 
Department  would  establish  a  standard  that  was  appropriate  for  the  big  city 
districts,  and  that  their  own  students  would  find  such  a  standard  too  easy. 
More  opponents  of  comparability,  however,  were  just  concerned  with  the  basic 
issue  that  comparisons  of  student  results  would  lead  to  unfair  judgments  of 
teacher  efforts.   These  people  felt  that  such  comparisons  would  lead  to 
pressures  to  improve  test  scores  that  would  be  counterproductive. 

RMC  believes  that  there  are  two  types  of  comparisons  here  that  should 
be  considered:   comparisons  that  enable  one  to  compare  the  difficulty  of 
one  item  to  another  (or  one  test  to  another) ,  and  comparisons  that  permit 
one  to  examine  the  score  of  a  student  to  students  throughout  Massachusetts. 
It  was  not  until  RMC  had  visited  a  few  sites  that  the  importance  of  this 
distinction  became  clear.   After  the  issue  was  presented  as  two  separate 
issues,  it  was  evident  that  there  was  much  more  support  for  the  former  type 
of  comparison  than  for  the  latter. 

Issue  #6:   Should  the  Department  provide  an  item  writing  service  for  school 
districts? 

Regardless  of  the  system  which  the  state  chooses  to  provide  tests,  it  can 
be  expected  that  there  will  be  objectives  that  some  people  will  want  measured 
for  which  the  Department  has  provided  no  items.   As  one  means  of  assistance 
to  districts,  it  was  suggested  that  the  Department  offer  an  item-writing 
service  to  school  districts,  either  by  hiring  item  writers  or  by  maintaining 
a  collection  of  consultants  who  could  be  called  in  to  do  work  as  needed. 

Among  the  alternatives  raised  by  this  study,  this  alternative  received 
the  most  consistently  negative  response.   First  of  all,  most  people  felt 
that  few  districts  would  write  additional  objectives.   The  judgment  was  that 
the  Basic  Skills  objectives  produced  were  pretty  much  a  complete  set  of  what 
would  be  measured.   In  addition,  it  was  clear  that  such  a  system  would  be 
cumbersome.   In  order  to  have  an  outside  agency  write  test  questions  that  meet 


17 


a  specific  demand,  a  district  would  have  to  communicate  very  specifically 
what  it  wanted  for  questions.   By  the  time  the  district  could  effectively 
communicate  its  request  it  probably  would  have  been  easier  for  them  to 
write  the  questions  themselves. 

In  short,  while  such  a  system  might  have  some  merit,  in  certain  applica- 
tions, it  would  be  expensive,  cumbersome  and  largely  unused  for  the  Basic 
Skills  Improvement  Policy. 

Issue  #7:   Should  the  Department  assess  competence  in  writing  via  a  holisti- 

cally  scored  writing  sample,  an  analytically  scored  writing  sample, 
or  both? 

Among  the  districts  sampled  in  this  study,  the  issue  of  how  writing 
samples  should  be  scored  elicited  an  extremely  consistent  response.   Virtually 
everyone  agreed  that  writing  tests  must  be  scored  holistically .   Among  those 
who  had  attended  the  holistic  scoring  workshops,  endorsement  of  this  method 
was  enthusiastic  and  unanimous.   There  was  also  unanimity,  however,  in  the 
feeling  that  analytic  scoring  methods  would  be  needed  to  diagnose  problems 
among  students  who  failed  to  meet  the  minimum  standard.   Several  districts 
called  upon  the  state  to  develop  analytic  scoring  procedures  for  this  purpose 
preferably  in  the  form  of  a  checklist.   These  districts  also  desired  training 
in  using  analytic  methods.   The  only  negative  comment  regarding  the  use  of 
holistic  scoring  was  that  it  made  standard  setting  difficult. 

Summary.   The  feedback  obtained  from  the  eight  school  districts  is  summarized 
as  the  following  points: 

1.  Tests  that  are  longer  than  those  currently  available  would  be  viewed 
generally  as  redundant  and  unnecessary. 

2.  Publication  of  more  testing  materials  by  the  state  could  be  perceived 
as  an  attempt  to  influence  local  curriculum. 

3.  Very  few  districts  perceived  a  need  for  longer  tests. 

4.  Most  districts  would  have  little  objection  to  a  longer  test  being 
published  so  long  as  a  short  test  were  directly  available. 

5.  Multiple  forms  of  the  test  will  need  to  be  made  available  each  year. 

6.  No  more  than  two  equivalent  forms  per  year  will  be  necessary. 

7.  Most  districts  view  an  item  pool  as  too  sophisticated,  complex  and 
expensive. 

8.  If  districts  are  to  construct  tests  from  a  pool,  they  will  be  concerned 
about  comparability  of  forms,  both  across  years  and  across  districts. 

9.  Little  flexibility  in  test  construction  is  needed  for  a  basic  skills 
program. 


18 


10.  There  was   strong    support  for   a   system  that   had  fixed   tests  provided 
for   those  who  wanted    them,    and   a  pool   available  for    those  who  wanted 
to   either  modify  the  fixed   test  or  develop  their  own. 

11.  A  computerized    system  would   be  viewed   as  unnecessarily  cumbersome  for 
a   basic    skills  program. 

12.  A  system  requiring  direct  out-of-pocket   costs  would  be  very  unpopular, 
no  matter  how  modest   the  costs. 

13.  There  are  two   types  of   comparability  data  that  might  be  provided: 
item  difficulty  and   student  achievement.      There  is  great  diversity  of 
opinion  about  both,    but  more  support  for   the  former   type  than  the 
latter. 

14.  An   item  writing    service  has   little  to  no  appeal  for  use  in  basic 
skills  testing. 

15.  Among    interviewees,    those  who  have  attended   the  holistic   scoring 
workshops   endorse  the  method  wholeheartedly. 

16.  Analytic   scoring   would  be  of  value  in  diagnosing  writing  problems. 


I 

■ 


1 
1 

1 
li 


19 


V.   ONE  ADDITIONAL  ISSUE 

At  all  the  meetings  held  with  school  districts,  RMC  emphasized  that 
the  issues  it  had  raised  might  not  be  a  complete  set  of  the  important 
issues.   District  staff  in  attendance  at  the  meetings  were  encouraged  to 
raise  any  additional  issues  of  concern. 

At  most  meetings,  no  additional  issues  were  raised,  and  the  discussion 
was  limited  to  the  issues  RMC  had  brought  to  the  meeting.   In  two  districts, 
however,  one  additional  issue  was  raised — what  could  be  done  about  the  prob- 
lem of  having  to  give  a  test  several  times  in  an  area,  either  because  several 
adjacent  districts  administered  the  test  on  different  dates,  or  because  a 
district  had  to  administer  a  test  several  times  to  accommodate  absentees. 

The  issue  is  one  of  legitimate  concern.   If  students  can  achieve  higher 
scores  on  the  Basic  Skills  test  provided  by  the  Department  by  finding  out  the 
questions  (and  answers)  from  those  who  were  tested  previously,  the  credibility 
of  the  test  results  will  suffer. 

To  some  extent,  this  issue  is  less  important  than  it  might  have  been 
had  the  program  been  established  so  that  the  consequence  to  the  students 
for  unsatisfactory  performance  were  greater  (such  as  denial  of  a  diploma) „ 
However,  given  the  current  scope  of  the  Basic  Skills  Improvement  Policy, 
it  is  difficult  to  imagine  that  students  would  go  too  far  out  of  their  way 
to  find  out  the  questions  before  taking  the  test. 

Nonetheless,  the  problem  is  real,  and  is  not  solved  readily.   None 
of  the  systems  RMC  proposed,  with  the  possible  exception  of  the  computer- 
ized system,  adequately  addressed  it.   The  only  complete  solution  would  be 
to  make  available  to  each  district  enough  equivalent  tests  to  adequately 
cover  all  the  administrations  it  planned  to  conduct.   This  solution  is  not 
feasible,  either  economically  or  administratively,  under  any  of  the  proposed 
systems. 

The  fact  that  none  of  the  proposed  systems  could  be  used  to  solve  this 

problem  was  a  factor  in  RMC's  decision  to  reject  all  the  originally  proposed 

systems.   The  system  proposed  in  this  report  solves  the  problem  by  allowing 

each  district  to  construct  as  many  tests  as  it  deems  necessary  to  insure  that 

each  group  of  students  being  tested  is  presented  with  a  fresh  set  of  test 
questions. 


20 


VI.   FEATURES  OF  A  FEASIBLE  SYSTEM 

In  Section  IV,  a  series  of  issues  were  discussed,  and  the  feedback 
received  from  districts  on  these  issues  was  reported.   The  purpose  of  this 
section  will  be  to  synthesize  the  feedback  obtained  from  the  districts  and 
the  concern  expressed  by  the  Department  into  a  series  of  essential  and 
desirable  features. 

There  appear  to  be  several  features  which  any  system  proposed  for  this 
application  in  Massachusetts  should  have.   Some  of  these  features  are  con- 
sidered essential;  RMC  predicts  that  a  system  missing  any  of  the  essential 
features  would  prove  to  be  unsatisfactory.   The  other  features  are  considered 
desirable;  systems  that  include  these  features  would  be  preferred  over 
systems  that  do  not. 

It  will  be  shown  in  the  next  section  of  this  report  that  few  of  the 
systems  RMC  proposed  at  the  beginning  of  this  study  (provided  in  Appendix  A) 
have  all  the  essential  features,  and  none  of  them  has  both  all  the  essential 
and  all  the  desirable  features.   A  new  system,  one  which  contains  all  the 
features,  will  be  described  and  reviewed  in  the  following  section. 

Essential  features.   Following  are  the  essential  features  which  any  system 
must  have  in  order  to  be  judged  as  feasible: 

1.  The  basic  service  provided  by  the  system  must  be  available  at  no 
cost  to  each  district.   In  the  site  visits,  most  districts  had  an 
instant  and  strongly  negative  attitude  toward  any  system  which 
required  extra  costs,  even  if  those  costs  were  "nominal." 

2.  The  system  must  be  maintainable,  in  the  long  run,  by  existing  staff 
in  the  Bureau  of  Research  and  Assessment.   This  means  that  any  sys- 
tem must  be  manageable  by  Bureau  staff  of  the  current  size  (4  pro- 
fessionals) ,  and  within  their  range  of  technical  expertise. 

3.  Two  equivalent  forms  of  the  test  must  be  provided  each  year.   Dis- 
tricts argued  effectively  that  they  needed  more  than  one  form  of 
the  test  each  year,  but  there  appeared  to  be  little  need  for  more 
than  two. 

4.  Districts  must  be  able  to  use  the  system  as  it  comes  with  little 
extra  effort.  Many  districts  do  not  have  the  resources  to  create 
tests  from  materials  supplied  by  the  Department.  While  some  small 
amount  of  administrative  detail  could  be  accommodated  (such  as  re- 
printing copies  of  the  test  provided  by  the  Department) ,  a  system 
that  required  committees  of  teachers  to  be  empaneled,  for  example, 
would  place  an  undue  burden  on  small  school  districts. 

5.  School  district  personnel  must  be  able  to  understand  the  system. 
The  actual  essential  feature  is  that  the  public  will  accept  the 
system,  but  public  acceptability  will  not  be  known  before  the  system 
is  in  place.   However,  it  is  fairly  certain  that  the  public  will  not 
accept  a  system  it  does  not  understand. 


21 


The  assumption  being  made  here  is  that  a  system  which  is  understood  by 
school  district  personnel  will  also  be  understood,  and  accepted,  by  the 
public  at  large. 

Desirable  features.   Following  are  features  which  a  system  should  have  in 
order  to  make  it  preferable: 

1.  The  tests  provided  should  be  short.   As  noted  in  the  discussion  of 
Issue  1,  there  are  several  reasons  to  favor  a  short  test:   minimally 
competent  students  are  unnecessarily  and  unfairly  taxed  by  a  long 
test;  there  is  little  perceived  need  for  a  long  test;  the  state  could 
not  produce  a  single  long  test  that  would  meet  the  varied  needs  of 
districts;  and  a  long  test  could  be  perceived  as  an  attempt  by  the 
Department  to  dictate  curriculum. 

2.  The  system  should  be  flexible  enough  to  accommodate  future  modifica- 
tions in  the  Basic  Skills  Improvement  Policy.   Many  of  the  people 
whom  RMC  interviewed  felt  it  was  likely  that  there  would  be  some 
changes  in  the  Basic  Skills  Improvement  Policy  over  the  next  few 
years.   Two  changes  that  were  expected  by  many  were  an  expansion  of 
the  Department's  effort  to  provide  assistance  into  the  early  and 
late  elementary  levels,  and  a  shift  toward  making  successful  comple- 
tion of  a  basic  skills  test  a  high  school  graduation  requirement. 
While  no  changes  in  the  Basic  Skills  Improvement  Policy  are  antici- 
pated before  the  1983-84  school  year,  it  would  be  desirable  for  a 
system  to  accommodate  changes  should  they  occur. 

3 .  The  system  should  be  flexible  enough  to  permit  districts  to  use  the 
tests  for  a  variety  of  purposes.   While  one  of  the  essential  features 
is  that  districts  must  be  able  to  use  the  system  as  it  comes,  some 
districts  have  the  desire  and  resources  to  use  the  testing  program  for 
more  than  the  minimals  of  basic  skills  testing.   For  example,  while 
most  districts  would  feel  no  need  to  develop  diagnostic  tests  related 
to  basic  skills,  some  would  like  to  do  so  if  supportive  materials  from 
the  Department  were  available  to  help  them.   Consequently,  other  things 
being  equal,  a  system  that  permits  such  flexibility  should  be  adopted 
over  one  that  does  not. 


22 


VII.   FEASIBILITY  OF  PROPOSED  SYSTEMS 

As  mentioned  in  the  previous  section,  few  of  the  original  systems  RMC 
proposed  (detailed  in  Appendix  A)  have  all  the  essential  features  mentioned 
above,  and  none  has  all  the  essential  features  and  all  the  desirable  fea- 
tures.  The  purpose  of  this  section  will  be  to  describe  the  shortcomings 
of  each  of  the  systems  RMC  originally  proposed. 

The  reader  will  note  that  discussion  of  the  fifth  essential  feature 
(that  school  district  personnel  must  understand  the  system)  has  been  omitted 
from  consideration  in  this  section.   That  is  because  of  RMC's  intent  to 
test  any  conceivably  feasible  system  for  this  feature  at  the  workshops  planned 
under  the  contract.   Since  that  feature  was  to  be  tested  empircally,  there 
was  no  need  to  operate  on  conjecture.   If  a  system  were  to  fail  this  feature, 
that  would  be  known  after  the  workshops. 

System  A  —  MDE  publishes  a  single  short  test.   This  system  does  not 
have  the  third  essential  feature  (two  forms)  in  that  only  one  form  would  be 
provided  each  year.   In  addition,  the  system  lacks  the  flexibility  called  for 
in  the  second  and  third  desirable  features  (future  modifications  and  local 
flexibility) . 

System  B  —  MDE  publishes  a  single  long  test.   This  system  fails  for  the 
same  reasons  as  System  A,  plus  the  length  of  the  test  violates  the  first 
desirable  feature  (short  test) . 

System  C  —  MDE  publishes  multiple  forms  of  a  short  test.   This  system 
is  the  one  currently  provided  by  the  Department.   It  fulfills  all  the  essen- 
tial requirements,  and  could  be  simplified  by  reducing  the  number  of  equiva- 
lent forms  made  available  to  two  per  year.   The  system  does  lack  two  desirable 
features,  however,  in  that  it  is  not  flexible. 

System  D  —  MDE  publishes  multiple  forms  of  a  long  test.  While  this 
system  contains  all  of  the  essential  features,  it  contains  none  of  the  desir- 
able ones.   System  C,  providing  short  tests,  would  be  preferred  to  System  D. 

System  E  —  MDE  publishes  a  short  test  and  a  long  test.  While  this 
system  does  provide  two  tests  each  year,  the  short  one  would  just  be  a  subset 
of  the  items  in  the  long  one,  and  thus  would  not  fulfill  the  needs  of  those 
who  need  two  tests  each  year  (an  essential  feature) .   In  addition,  the  system 
would  fail  the  flexibility  requirements  of  the  desirable  features. 

System  F  —  MDE  publishes  multiple  forms  of  both  the  long  test  and  the 
short  test.   This  system  fails  the  second  essential  feature  (maintainable  by 
existing  staff)  since  a  system  involving  this  much  test  development  would  go 
far  beyond  the  current  resources  of  the  Department.   In  spite  of  ths  high 
costs,  this  system  would  have  no  more  flexibility  than  the  previous  five. 

System  G  —  MDE  publishes  a  pool  of  items.   This  system  violates  the 
third  and  fourth  essential  features  (two  forms  and  no  extra  effort) .   RMC 
found  that  the  distribution  of  a  pool  of  items,  while  originally  believed  to 
be  one  of  the  better  systems,  is  not  feasible  because  too  many  districts 


23 


would  not  have  the  resources  to  implement  it.   These  are  the  same  districts 
who  could  not  afford  to  develop  a  test  on  their  own;  thus,  this  system 
would  provide  the  least  help  to  those  who  need  it  the  most. 

System  H  —  MDE  publishes  a  single  short  test,  but  provides  additional 
items  on  request.   As  noted  in  the  discussion  on  the  issues,  it  appears  that 
any  system  the  Department  would  establish  to  write  additional  items  on  request 
would  not  be  feasible.   The  system  would  be  expensive  and  unnecessarily  burden- 
some, since  by  the  time  a  district  had  developed  a  request  for  item  writing 
that  would  communicate  effectively,  the  item  could  have  been  written.   This 
system  adds  a  layer  of  complexity  on  the  system  described  above  and  makes  them 
much  more  expensive  and  time  consuming  while  adding  little  additional  flexibility 

System  I  —  MDE  publishes  a  single  long  test,  but  provides  additional  items 
on  request.   This  system  is  judged  to  not  be  feasible  for  all  the  reasons  cited 
under  Section  H,  with  the  additional  negative  aspects  that  the  test  provided 
is  a  long  one. 

System  J  —  MDE  publishes  separate  tests  for  each  district.  While  in  some 
ways,  a  system  of  computed -gen era ted  tests  seems  most  appealing  and  would  be  a 
preferred  system  for  many  applications,  it  is  judged  to  be  not  feasible  for 
this  particular  situation.  While  the  total  costs  of  this  system  might  be  less 
than  many  of  the  others  proposed,  it  would  be  necessary  for  some  direct  costs 
to  be  assumed  by  each  participating  district,  unless  the  Department  established 
the  system.   These  costs  would  violate  the  first  essential  feature. 

Perhaps  more  importantly,  a  computerized  system  would  require  each  dis- 
trict to  make  a  determination  of  what  its  test  should  be  like.  While  such 
flexibility  makes  this  system  very  attractive  at  first  glance,  it  makes  the 
system  violate  the  fourth  essential  feature  (no  extra  effort) .   In  addition, 
Department  staff  feel  that  they  could  not  maintain  such  a  system.  While 
there  are  many  applications  for  which  RMC  would  view  a  computerized  test 
construction  system  as  an  ideal,  it  appears  that  such  a  system  would  not  be 
feasible  for  the  Basic  Skills  Improvement  Policy. 

The  information  presented  above  is  summarized  in  Table  1.  A  checkmark 
has  been  placed  where  a  system  possesses  a  feature,  and  a  blank  where  it  does 
not.   As  mentioned  in  the  introduction,  the  fifth  essential  feature  (that 
school  district  personnel  must  be  able  to  understand  the  system)  is  not  in- 
cluded here,  because  the  understandability  of  any  system  is  open  to  conjecture, 
and  RMC  tested  that  feature  at  the  workshops. 

Of  the  systems  originally  proposed,  only  one  contains  all  the  essential 
features  necessary  for  a  feasible  system  —  System  C.   System  C  basically  is 
the  system  currently  in  use  by  the  Department  of  Education.   However,  System 
C  lacks  two  of  the  desirable  features,  both  of  which  are  related  to  flexibility. 
It  appeared  that  if  a  system  could  be  proposed  that  offered  all  the  features  of 
System  C  and,  in  addition,  provided  substantial  flexibility  at  little  or  no 
extra  cost,  it  would  be  preferable  to  System  C.   Before  proceeding  with  the 
workshops,  RMC  used  the  information  it  had  collected  to  this  point  to  form  the 
selection  of  a  system  that  appeared  to  be  superior  to  any  of  those  originally 
proposed.   Those  initial  decisions  are  presented  in  the  next  section  of  this 
report. 


Table  1 

Features  which  are  satisfied  by 
the  originally  proposed  systems. 


24 


Features 


Essential 

Desirable 

System 

No 
Cost 

Maintainable 
by  Bureau 

Two 

Forms 

Little 

Extra 
Effort 

Short 
Tests 

BSIP 
Flexible 

Distric 

Use 

Flexibl 

A,  Single  short  test 

/ 

/ 

/ 

/ 

B.  Single  long  test 

/ 

/ 

/ 

C.  Multiple  short 
test 

/ 

/ 

/ 

/ 

/ 

D.  Multiple  long 
test 

/ 

/ 

/ 

/ 

- 

E.  Single  short  and 
long  test 

/ 

/ 

/ 

/ 

• 

F.  Multiple  short 
and  long 

/ 

/ 

/ 

/ 

G.  Item  pool 

/ 

/ 

/ 

/ 

/ 

H.  Short  test  and 
Additional  items 

/ 

/ 

/ 

I.  Long  test  and 
Additional  items 

/ 

/ 

■ 

J.  Computerized 
System 

4 

/ 

/ 

/ 

/ 

25 


VIII.   SUMMARY  OF  INITIAL  DECISIONS 

By  January,  1980,  it  was  clear  that  some  aspects  of  the  final  system 
RMC  would  be  proposing  were  determined,  and  that  it  was  clear  what  the  re- 
maining questions  were.   The  purpose  of  this  section  will  be  to  describe  the 
known  aspects  and  to  specify  the  questions  that  remained  at  this  point. 

Item  Pool.   It  was  clear  that  the  Department  should  begin  to  develop  its 
test  from  an  item  pool.   It  already  had  developed  over  200  questions  in  both 
reading  and  math,  and  under  the  system  it  has  been  employing,  would  be  likely 
to  develop  another  500  or  so  in  each  content  area  over  the  next  5  years. 
Developing  each  of  the  tests  from  scratch  each  year  would  be  a  costly  process. 
There  is  no  question  that  developing  a  pool  of  500  to  1,000  items  at  the 
beginning  of  the  five  year  cycle,  and  then  drawing  tests  from  that  pool  each 
year,  would  be  no  more  expensive  than  the  current  procedure  and  most  likely 
would  be  far  less  costly. 

The  first  reason  why  this  procedure  would  cost  no  more  is  that  the 
Department  could  investigate  the  possibility  of  purchasing  an  item  pool  that 
has  been  constructed  by  some  other  agency.   Because  of  the  nature  of  the 
Basic  Skills  objectives,  it  is  likely  that  other  pools  developed  for  the  same 
purposes  would  measure  most,  if  not  all,  of  those  objectives.   Since  agencies 
that  develop  pools  are  content  to  recover  just  part  of  their  costs  when  they 
sell  those  pools  to  other  agencies,  the  costs  of  purchasing  usually  are  far 
less  than  those  of  development.   Some  agencies  even  permit  use  of  their  item 
pool  without  charge. 

This  is  not  meant  to  imply  that  the  only  costs  of  acquiring  a  pool  for 
use  in  Massachusetts  would  be  the  initial  purchase  price.   The  items  would 
have  to  be  reviewed,  revised  or  rejected  as  appropriate,  and  administered 
to  a  sample  of  Massachusetts  students.   If  many  items  in  the  initial  pool  are 
judged  to  be  unsatisfactory,  the  total  costs  of  this  process  could  be  more 
than  the  costs  of  writing  all  the  items  from  scratch.   But  these  costs  can  be 
estimated  with  a  fair  degree  of  precision  in  advance,  and  if  the  purchase  and 
revision  of  a  pool  were  going  to  cost  more  than  writing  all  the  items  from 
scratch,  that  approach  would  be  rejected.   Thus,  the  purchase  of  a  pool  could 
not  be  more  expensive  than  writing  items  from  scratch;  the  reality  of  the 
situation  is  that  it  is  likely  to  cost  far  less. 

The  other  aspect  of  a  pool  that  could  make  it  less  costly  than  developing 
a  new  set  of  items  each  year  are  the  efficiencies  that  can  be  built  in  by 
doing  all  of  the  field  and  pilot  testing  one  time  rather  than  five  times. 
Even  if  all  the  items  must  be  written  from  scratch  because  purchase  of  a  pool 
is  not  feasible,  it  would  cost  less  to  write  and  test  several  hundred  at  one 
time  than  to  write  a  hundred  or  so  each  year  for  5  years.   Thus,  it  was  clear 
to  RMC  that  one  recommendation  must  be  that  the  Department  generate  all  of  its 
tests  from  a  pool,  rather  than  writing  new  tests  each  year.   A  second  recommend 
ation  would  be  that  the  Department  pursue  the  feasibility  of  purchasing  a  pool 
before  it  resumes  further  item  writing  activities. 

Test  development.   It  also  was  clear  that  the  Department  should  provide 
two  parallel,  equivalent  forms  of  the  test  each  year.   These  two  forms  should 
be  developed  by  drawing  items  systematically  from  the  item  pool.   Camera- 


26 


ready  copies  of  these  tests  would  be  provided  to  all  districts  that  indi- 
cated they  would  be  using  the  State  Test  to  assess  their  students. 

Other  issues.   In  January,  it  still  was  unclear  whether  the  Department 
should  release  the  pool  to  districts  for  their  use,  and  if  so,  what  types 
of  statistics  should  be  provided  along  with  the  pool  so  that  district  per- 
sonnel could  use  it  effectively.   In  order  to  gather  information  on  how  these 
issues  should  be  resolved,  a  series  of  workshops  were  planned.   These  work- 
shops, and  the  information  that  was  acquired  from  them,  are  discussed  in  the 
next  section  of  this  report. 


27 


IX.   THE  WORKSHOPS 

Purpose 

Throughout  the  contract  RMC  has  attempted  to  define  a  feasible  system 
in  terms  of  local  district  perceptions  as  well  as  its  technical  characteris- 
tics.  It  was  through  interviews  with  local  districts  that  we  arrived  at  a 
set  of  essential  and  desirable  features,  and  it  was  through  a  series  of 
regional  workshops  that  we  chose  to  investigate  two  questions  about  the 
proposed  system  which  were  left  unresolved  in  January. 

The  first,  and  most  critical,  of  these  questions  was  whether  the  item 
pool  should  be  maintained  securely  within  the  Department  for  its  use  in 
constructing  equivalent  test  forms,  or  should  be  released  to  local  districts 
for  their  use  in  constructing  tests  equivalent  to  the  state  test.   The 
answer  to  that  question  has  many  implications  for  the  way  the  item  pool  would 
be  designed.   For  example,  if  the  pool  is  maintained  securely  and  districts 
are  presented  only  with  the  tests  they  will  use,  each  item  in  the  pool  will 
have  to  meet  with  complete  acceptance  by  all  districts  using  the  states  tests. 
If  the  pool  is  released,  items  which  some  districts  felt  were  not  usable  for 
their  students  could  remain  in  the  pool  for  use  by  others.   If  the  pool  is 
released,  it  must  also  be  larger  than  if  it  were  maintained  securely.   In 
order  to  satisfy  varying  district  needs  the  pool  would  have  to  contain  a 
broader  range  of  items  within  an  objective,  both  in  content  and  difficulty, 
and  it  would  have  to  contain  more  items  per  objective. 

The  second  major  question  to  be  addressed  by  the  workshops  was  whether 
or  not  school  personnel  would  be  able  to  understand  a  system  which  called  for 
them  to  use  statistics  provided  with  the  items  in  the  pool  to  develop  their 
own  form  of  the  state  test.   Although  RMC  had  included  as  an  essential 
feature  the  requirement  that  a  system  be  understandable,  there  was  no  way  in 
January  that  the  proposed  system  could  be  evaluated  on  that  dimension.   It 
is  one  thing  to  describe  to  someone  how  an  item  pool  might  be  used,  and  quite 
another  for  that  person  to  actually  construct  and  interpret  a  test  from  the 
pool.   RMC  insisted  on  withholding  judgment  about  how  understandable  such  a 
system  would  be  until  a  large  sample  of  districts  had  received  training  and 
hands-on  experience  with  a  simulated  item  pool. 

Content 

The  question  which  we  felt  most  workshop  participants  would  be  asking 
is  "what  can  an  item  pool  do  for  me?"  It  was  important,  therefore,  that  the 
workshop  be  organized  around  some  commonly  held  local  district  concerns. 
These  anticipated  concerns  were  divided  into  two  major  categories?   (1) 
questions  related  to  using  the  item  pool  to  construct  tests,  and  (2)  questions 
related  to  interpretation  of  the  results  obtained  from  these  tests.   Within 
each  category  RMC  generated  specific  questions  which  local  districts  might 
wish  to  answer  if  the  item  pool  were  released  for  their  use.   The  questions  in 
the  test  constructing  category  were:   (1)  how  can  you  replace  items  in  the 
state  test,  and  (2)  how  can  you  change  the  length  of  the  state  test.   Questions 
which  fell  in  the  test  interpretation  category  were:   (1)  how  can  you  compare 
the  difficulty  of  one  test  to  another;  (2)  how  can  you  compare  the  average 
score  of  your  students  to  that  of  students  statewide,  (3)  if  all  twelfth  grade 
students  in  the  state  had  taken  your  test,  what  percent  would  have  passed  your 


standard,     (4)    how   can   you   make   this   year's    standard    equivalent    to    last 
year's    standard    if    you   change   the   test,    (5)    how  can  you   determine  which 
content   areas   are   your   district's   relative   strengths   and   weaknesses,    and 
(6)    if    the   test   has   changed    since   last   year,    how  can   you  determine   the 
content   areas    in  which  your   districts    scores   have  changed. 

The   extent    to   which   these  questions   can  be   answered   depends   on   the 
type   of    item    statistics   which  are   supplied    to    the  user   of    the    item   pool. 
When  very  crude  measures   of    item  difficulty   are  provided,    very   few   of    the 
questions   can   be  addressed.      When  more   sophisticated  measures   of    item  diffi- 
culty  are  provided,    all   of    the   questions   can  be  addressed.      RMC's   strategy 
in   the  workshops   was    to   present    item   statistics   of    increasing    sophistication 
using    exercises   from  a    simulated    item  pool.      In   this  way,    it   was   hoped    that 
local  district   personnel   would   discover    the   level   of    sophistication  which 
seemed  most   appropriate   for    their   district,    both   in   terms   of    the   questions 
which  could    be   answered   and    the   statistical   procedure   involved    in  answering 
them. 

Three   item    statistics  were  chosen   for  discussion:      difficulty  category 
values,    P-values,    and   Rasch   scale  values.      Difficulty   category  values,    which 
we  called   DCVs,    represented    the  lowest    level   of    sophistication.      Within   the 
range   of   difficulty  chosen  for    the   item   pool,    items  were  classified    into 
six  difficulty  categories.      Items   in   the   easiest   category  were  assigned   a 
value  of    1;    items   in  the  hardest   category  were  assigned   a  value  of   6.      Each 
category  represented   about    one-half   of   a   log  it   on  the  Rasch   scale.      DCVs, 
therefore,    were   actually   simplified    (one-digit)    Rasch   scale  values. 

The  second   type  of    item   statistic   presented   was  the  P-value,    or   the 
proportion  of    students   that  answered   the   item  correctly.      For    instance,    a 
P-value  of    .83   would  mean   that  when   the   item  was  administered   to   a   sample 
of    students   83%   of    them   answered    it   correctly.      Since  P-values   are  already 
used   by   the  Department,    they   represented    the   status   quo    in   terms   of    the 
questions  which  could    be  answered.      What   workshop   participants   gained   was 
merely   some  practical    experience   in  applying   P-values   to    the  use  of    an   item 
pool. 

The  third    statistic   presented   was   the  Rasch  scale  value,    which  we  called 
R-values.      These   statistics  represented   the  highest   level  of    sophistication. 
However,   no  attempt  was  made  to   introduce  participants   to   latent   trait   theory 
or   the  process  of    item  calibration.      R-values  were  merely  presented   as 
another   type  of    item   statistic  whose  technical  properties  permitted   a  wider 
range  of   questions  to   be  answered.      The  R-value   scale  which  we   employed    in 
the  workshops  rated    item  difficulty   on  an  open-ended    three-digit    scale.      Thus, 
an    item  with  an  R-value   of    150  would   be   easier   than   an   item  with  an  R-value 
of    125. 

The  workshop,    as    initially  prepared,    led   participants    through  three   sets 
of    exercises-one   set   for    each   of    the   three   types   of    item   statistics.      These 
exercises   required    participants   to   assemble   tests   from   the    item  pool   and 
walked    them   through   the   steps   which  would   be  necessary   to    interpret    the   tests 
they  had   constructed.      Participants  were   required    to   construct    their    tests 
in   such  a   way   that    they  would   be   equivalent    to    the   two   forms  released   by   the 
Department    each  year.      RMC   felt    it   was    important    to   have  participants   actually 
use   each   of    the   three   statistics   before  deciding   which  of    the   three  best    suited 
their   needs. 


29 


It  was  decided  that  once  participants  had  been  through  the  exercises, 
a  questionnaire  would  be  used  to  solicit  feedback  on  whether  or  not  the 
item  pool  should  be  released  to  local  districts,  the  type  of  item  statistics 
which  should  be  provided,  and  the  importance  of  being  able  to  answer  the 
questions  around  which  the  workshop  was  organized. 

Revision 

Before  presenting  the  workshop  to  local  districts,  RMC  presented  it 
to  the  Bureau  of  Research  and  Assessment,  the  Review  Committee  and  the 
Regional  Center  Staff  for  their  comments.   As  it  turned  out,  the  comments 
of  these  people  were  extensive  and  a  major  revision  of  the  workshop  ensued. 
Foremost  among  their  concerns  was  overemphasis  on  test  construction 
exercises.   As  originally  designed,  the  workshop  required  participants  to 
use  the  simulated  item  pool  without  first  explaining  the  rationale  behind 
having  an  item  pool  or  providing  any  background  for  the  eight  questions  which 
were  being  addressed.   Reviewers  also  reacted  unfavorably  to  the  amount  of 
repetitious  paperwork  required  of  participants  in  the  exercises,  and  suggested 
changes  in  the  simulated  guidelines  for  test  construction.   Specific  recommenda- 
tions were  also  made  regarding  the  format  of  certain  workshop  materials. 

In  response  to  these  concerns  RMC  dropped  all  but  two  of  the  test  con- 
struction exercises  from  the  workshop  and  added  an  introduction  which  explained 
the  use  of  item  pools  in  some  depth,  including  the  rationale  behind  the 
feasibility  study,  advantages  and  disadvantages  of  releasing  the  pool,  and  an 
introduction  to  the  series  of  eight  questions  which  we  would  be  addressing. 
Participants  still  went  through  the  hands-on  simulation  of  constructing  a  test 
using  DCVs,  but  the  exercise  was  not  repeated  for  P-values  and  R-values.   In- 
stead of  working  through  exercises,  participants  followed  concrete  examples 
carried  out  by  the  presenter.   Revisions  were  also  made  in  the  materials . 
Guidelines  were  simplified  and  item  statistics  were  presented  in  tables  rather 
than  in  the  item  pool  itself. 

RMC  is  extremely  grateful  to  the  Bureau,  the  Review  Committee  and  the 
Regional  Center  Staffs  for  their  feedback  on  the  initial  version  of  the  work- 
shop.  In  its  revised  form  the  workshop  was  extremely  well  received,  and  a 
large  share  of  the  credit  for  that  success  belongs  to  those  who  shared  their 
reactions  with  us  during  that  initial  presentation. 

Presentation 

The  workshop  was  presented  at  each  of  the  six  regional  education  centers 
around  the  state  according  to  the  following  schedule: 

Pittsfield  March  4,  1980 

Springfield  March  5,  1980 

West  Boylston  March  11,  1980 

Lakeville  March  12,  1980 

Cambridge  March  18,  1980 

North  Reading  March  19,  1980 


30 


Invitations  were  mailed  by  RMC  to  all  Superintendents  and  Basic  Skills 
Coordinators  in  the  state  with  instructions  to  reply  to  their  Regional 
Linkers.   Regional  Linkers  arranged  workshop  facilities  at  each  location. 

Each  workshop  was  introduced  by  a  staff  member  from  the  Bureau  of 
Research  and  Assessment.   Dr.  Richard  Hill  presented  the  introduction  to 
item  pools  and  the  portions  of  the  workshop  which  dealt  with  test  interpreta- 
tion.  Dr.  Richard  Lyczak  presented  portions  dealing  with  test  construction. 
The  workshops  were  attended  by  small  but  intensely  interested  audiences. 
The  questions  asked  of  the  presenters  were  of  extremely  high  quality  and 
participation  was  generally  enthusiastic.   There  was  one  exception  to  this 
pattern.   The  mood  in  one  of  the  six  workshops  tended  to  be  on  the  antagon- 
istic side.   Comments  tended  to  be  negative  and  there  seemed  to  be  less 
interest  in  the  material  being  presented.   Except  for  the  fact  that  the  room 
in  which  this  workshop  was  held  was  awkwardly  large  and  impersonal,  we  have 
no  way  of  accounting  for  this  single  aberrant  case. 

Although  there  was  some  attrition  during  the  afternoon  session,  the 
vast  majority  of  workshop  participants  stayed  to  fill  out  the  questionnaire 
at  the  end  of  the  session.   Among  those  who  did  leave  early,  most  excused 
themselves  on  account  of  other  obligations  and  took  with  them  copies  of  the 
afternoon  materials. 


I 


31 


X.      FEEDBACK  FROM  THE  WORKSHOPS 

At   the   end   of    each  workshop,    participants  were  provided   with   the 
questionnaire  provided    in  Appendix   C   of    this   report.      Results   of    the 
questionnaire  are   summarized    in  Tables   2   and   3.      Complete  results  are 
reported    in  Appendix  D. 

There  were  a  total   of   205  people  from  133   different  districts  who 
attended   the  workshops.      Since  a   total  of   157    questionnaires  were  turned 
in  there  was  an  overall  average  response  rate  of   77%.      The  response  rates 
ranged  from  a  low  of   49%   to  a  high  of    94%. 

The  48   people  who  failed   to  turn   in  a  questionnaire  most  frequently 
were  those  who  did   not  attend   the  second  half   of    the  workshop.      The  first 
half   of    the  workshop  ran  from   9:30  until   11:30.      Then  a   lunch  break  was   taken, 
with  the   second  half   running   from  12:30  until   the  finish,   which  usually  was 
about   2:00.      Because  many  new  issues  were  raised    in  the  second   half,    question- 
naires were  given  only  to   those  who  heard   the  entire  presentation  and  con- 
sequently,   were  qualified   to   evaluate   it.      It    is  not   clear  what   the  opinions 
of    the  missing   23%  would   have  been  had   they  remained  and   completed   the 
questionnaire.      Over   half   of   these  people   let   the  workshop  presenters  know 
that   other   obligations  were  going    to  prevent   their  attendance  at   the  after- 
noon  session,   and   expressed  a  positive  attitude  toward   the  concepts  presented 
to   that  point.      But   some  of   the  remaining   half  may  have  left   because  of   a 
negative  opinion  toward  what  was   being   presented. 

The  results   of    the  questionnaire   show  that   the  general  reaction  of   par- 
ticipants  toward   the  concepts  presented   at   the  workshop  was   quite  positive. 
Most  people   (96%)    thought   the  Department   should  release  the   item  pool,    and 
most    (90%)    felt   they  would  use   it.      Over  four-fifths    (81%)    felt   they  would 
use  the  pool   to  replace   items,   while  over   two-fifths    (41%)    felt   they  would 
use  it   to  develop  a  new  test.      Interestingly,    there  were  22   respondents  who 
were  not  planning   to  use  the  State's  Basic   Skills   tests  during   1980-81,    and 
of    these  22,    95%  favored  release   of    the  pool  and   86%  felt   that   they  would 
use  it.      Thus,    the  concept  of    the   item  pool   seems   to   have  great   appeal   even 
for   those  who  have  preferred   to   look  beyond  what   the  Department   has  provided 
so  f  ar . 

Most    (85%)    felt   they  could  make  effective  use  of   DCV  and  P-values.      Only 
8%  of    the  responding   participants   objected   to  receiving  DCVs  and  P-values. 
P-values  and  R-values  were  equally  well  received   overall   in  the   "like  to 
receive"   category,   with  each  getting   a  positive  rating   from  54%.      Thus,   P- 
values  were  rated   as  highly  as  any  statistic    in   each  category,   while  the  other 
two  were  viewed   less  favorably   in  at   least   one  category. 

Participants  felt   that    it  was  desirable  to   be  able  to   answer   each  of    the 
eight   questions  presented   at   the  workshop ,      Even   the  lowest  rated   question 
(being  able  to  determine  the  percentage   of    students   statewide  who   could   pass 
a   local   standard   on  a   local   test)'  was  viewed   as  desirable  by   twice  as  many 
people  as  viewed    it   as  undesirable,    while   the  highest   rated    questions    (being 
able  to  compare  the  difficulty  of   one  test   to   another)    was  viewed   as  desirable 
by   a  ratio  of    13    to   1. 


32 


The  two  most  basic  questions  (replacing  items  in  the  test  and  being 
able  to  compare  the  difficulty  of  one  item  to  another)  received  the  highest 
desirability  ratings,  receiving  favorable  ratings  from  9  out  of  10  respon- 
dents.  The  next  four  (assessing  relative  strengths  and  weaknesses,  assuring 
that  standards  were  equivalent  from  year  to  year,  changing  the  length  of  the 
test  and  assessing  changes  in  strength  and  weakness  from  year  to  year)  were 
questions  that  would  appeal  to  people  who  wanted  to  use  test  results;  they 
were  judged  as  desirable  questions  by  between  80  and  85  percent  of  the  res- 
pondents.  The  remaining  two  questions  (making  comparisons  between  local 
students  and  students  statewide),  while  favorably  received,  were  judged  as 
less  desirable  than  the  others. 

Last,  but  certainly  not  least  to  the  RMC  staff,  was  the  overall  rating  of 
the  workshop.   Almost  9  out  of  10  respondents  judged  the  workshop  to  be 
"excellent"  or  "quite  good."  The  three  highest  rated  workshops  were  Springfield, 
Lakeville,  and  North  Reading,  where  in  each  case  the  workshop  received  an 
average  rating  of  4.45  on  a  scale  of  1  to  5 . 


I 


Table  2 
Summary  of  Results  from  Page  1  of  Questionnaire 


33 


Question: 

Response 

Percent   choos 
ing   response 

1.      Using   State's 

Basic   Skills   Tests? 

Yes 

86 

No 

9 

2.      Favor   release 
of    item  pool    ? 

Yes 

96 

No 

4 

3.      Use   pool   ? 

3a.    To   replace   items? 

Yes 

90 

No 

8 

81 

3b.   To  develop  new  test? 

41 

4.      Make  effective   use   of: 

DCVs? 
P-values    * 

Yes 

85 

No 

5 

Yes 

85 

No 

8 

R-values? 

Yes 

64 

No 

25 

5.      Like   to   receive: 

DCV? 

45 

P-value? 

54 

R-value? 

54 

6.      Object   to   receiving 

DCV? 

8 

P-value? 

8 

R-value? 

22 

None   of    these? 

66 

Table  3 
Summary  of  Results  from  Page  2  of  Questionnaire 


34 


Question 

Rating 

Percent 

Test 

Construction 

T 
1 

!  I. 

Replace  Items  in  Test 

Very  Undesirable 
Undesirable 
Desirable 
Very  Desirable 

7 

3 

29 

61 

II. 

| 

Change  Length  of  Test 

Very  Undesirable 
Undesirable 
Desirable 
Very  Desirable 

8 

6 

39 

43 

Test 

Interpretation 

I. 

Compare  Difficulty 

Very  Undesirable 
Undesirable 
Desirable 
Very  Desirable 

5 

2 

34 

13 
10 
34 
43 

II. 

Compare  Average  Score 
to  Statewide  Average 

i 

Very  Undesirable 
Undesirable 
Desirable 
Very  Desirable 

III.   What  Percent  State- 
wide Would  Pass  Your 
Standard 


IV.   Make  Standard 
Equivalent 


Very  Undesirable 
Undesirable 
Desirable 
Very  Desirable 

Very  Undesirable 
Undesirable 
Desirable 
Very  Desirable 


V.    Relative  Strengths 
and  Weaknesses 


Very  Undesirable 
Undesirable 
Desirable 
Very  Desirable 


6 

7 

30 

55 


VI.   Changes  in  Strengths 
and  Weaknesses 


Very  Undesirable 
Undesirable 
Desirable 
Very  Desirable 


VII „   Overall  Rating  of 
Workshop 


Excellent 
Quite  Good 
Fair 

Inadequate 
Poor 


6 
11 
39 

41 


43 

45 

10 

1 

0 


35 


XI.   SCALING  THE  ITEM  POOL 

Two  related  issues  were  resolved  during  the  planning,  operation  and 
review  of  the  workshops:   (1)  how  the  pool  should  be  scaled , (i„ e„ ,  the 
Department  should  go  about  determining  the  difficulty  of  the  items),  and 
(2)  the  types  of  restrictions  that  would  need  to  be  placed  on  tests  that 
were  developed  from  the  pool  in  order  to  insure  they  were  equivalent  to 
the  State  tests.   These  two  issues  are  reviewed  in  this  and  the  next 
sections  of  the  report. 

After  the  pool  has  been  developed,  it  will  be  necessary  to  determine 
what  the  relative  difficulties  of  the  items  are  ("scaling  the  pool")  .   It 
will  be  necessary  to  do  this  in  order  to  insure  that  equivalent  forms  developed 
each  year  are,  in  fact,  equivalent.   It  also  will  be  desirable  to  do  this, 
for  it  is  only  after  such  scaling  has  been  done  that  it  will  be  possible  to 
make  the  kinds  of  comparisons  needed  to  interpret  test  results  on  a  relative 
basis.   There  are  two  types  of  samples  of  students  that  the  Department  could 
select— -a  representative  sample  or  a  sample  of  convenience.   A  representative 
sample  is  one  selected  from  all  the  schools  in  the  state  with  the  expectation 
that  the  average  score  in  the  sample  equals  the  average  score  statewide;  a 
sample  of  convenience  is  one  selected  from  a  subset  of  schools  that  are  coopera- 
tive and  will  present  few  administrative  difficulties.   Such  a  sample  can 
never  be  assumed  to  be  representative  of  the  state.  With  a  representative 
sample,  one  could  estimate  the  proportion  of  students  statewide  who  could 
answer  correctly  each  question  administered  to  the  sample.  With  a  sample 
of  convenience,  the  Department  could  determine  the  relative  difficulties  of 
items  ("Item  A  is  5  percent  harder  than  Item  B") .   But  since  it  would  not 
know  the  relationship  between  the  sample  chosen  and  statewide  averages,  it  would 
not  be  able  to  determine  the  proportion  of  students  statewide  who  could  answer 
the  questions  correctly. 

When  samples  of  convenience  are  chosen,  it  is  sometimes  possible  to 
estimate  how  different  the  sample  is  from  the  population.   This  is  done  through 
a  process  called  linking.   If,  for  example,  we  were  to  administer  to  our  sample 
a  set  of  items  that  have  been  normed  statewide  we  could  estimate  how  different 
the  performance  of  the  sample  is  from  the  performance  of  the  whole  population. 
To  obtain  estimated  statewide  averages,  item  difficulties  could  be  adjusted  by 
the  amount  of  that  difference.   Similarly,  if  we  were  to  administer  a  set  of 
new  items  to  students  whose  relationship  to  the  population  is  known,  we  could 
estimate  how  different  those  new  items  were  from  the  old  items  by  observing  the 
performance  of  the  students  on  both  sets.   The  first  example  is  a  type  of  "item 
linking;"  the  second  is  "person  linking." 

Links  can  be  strong  or  weak.  When  we  collect  a  lot  of  data  on  a  wide 
variety  of  people  and  items,  the  link  is  a  strong  one,  and  it  is  possible  to 
estimate  very  precisely  what  the  performance  of  the  population  of  students 
would  be  on  the  universe  of  items.   More  often,  however,  we  are  constrained  by 
cost,  time  or  other  factors,  and  the  links  are  weaker.  When  the  link  is  weak, 
the  same  estimates  can  be  made,  but  there  is  greater  uncertainty  about  them; 
the  probability  is  greater  that  our  estimates  have  some  error  in  them. 


36 


After  Massachusetts  develops  its  item  pool,  it  could  norm  all  the 
items  in  the  pool  on  a  representative  sample  of  students  in  the  ninth  and 
twelfth  grade  and  then  provide  those  results  to  districts  using  the  pool. 
However,  the  cost  of  conducting  such  a  study,  given  current  budget  levels 
for  the  Bureau  of  Research  and  Assessment,  is  prohibitively  expensive. 
What  the  Bureau  could  do,  and  what  RMC  recommends  that  they  do,  is  to  admin- 
ister the  items  in  the  pool  along  with  some  items  for  which  twelfth  grade 
norms  are  available  to  a  sample  of  convenience  of  ninth  and  twelfth  grade 
students  in  10  schools  around  Massachusetts.   Linking  data  would  then  permit 
estimation  of  statewide  performance  on  the  population  of  students  on  the 
universe  of  items. 

The  linking  would  work  as  follows.   The  administration  of  items  from 
previous  assessments  to  twelfth  graders  would  permit  estimation  of  how 
different  the  sample  of  convenience  of  twelfth  graders  is  from  what  a  repre- 
sentative sample  would  have  been.   This  difference  would  then  be  added  or 
subtracted  (depending  on  whether  the  sample  of  convenience  is  estimated  to 
be  higher  or  lower  scoring  than  the  state  as  a  whole)  to  each  of  the  item 
difficulties  obtained  by  the  sample  of  convenience.   Ninth  grade  estimations 
would  be  made  by  comparing  the  performance  of  ninth  graders  to  twelfth  graders 
in  the  sample  of  convenience,  and  then  using  that  value  to  adjust  all  the 
ninth  grade  results.   Because  the  ninth  grade  estimations  rely  on  two  links, 
they  are  then  going  to  be  less  precise  than  the  twelfth  grade  results. 

It  was  noted  earlier  in  this  section  that  estimations  of  relative 
difficulties  of  items  can  be  made  quite  precisely  with  small  samples  of  con- 
venience, while  estimations  of  the  proportion  of  students  statewide  who  could 
correctly  answer  a  question  are  made  with  far  less  precision.   As  a  conse- 
quence, conducting  a  sampling  of  convenience  will  have  differential  effects 
on  the  ability  to  answer  each  of  the  eight  questions  posed  at  the  workshops: 
some  will  be  able  to  be  answered  with  great  precision  from  a  sample  of  con- 
venience, while  others  will  not.   The  following  table  summarizes  that  infor- 
mation: 


37 


TABLE  IV 


The  precision  with  which  each  of  the  eight  questions  presented  at  the 
workshops  can  be  answered  by  results  from  a  small  sample  of  convenience. 


Questions 

Test  Construction 

I.   How  can  you  replace  items  in 
the  State  test? 


Precision  for   Precision  for 
the  Ninth  Grade  for  Twelfth  Grade 


very  high 


very  high 


II.   How  can  you  change  the  length       high 
of  the  state  test? 


high 


Test  Interpretation 

I.   How  can  you  compare  the  diffi-      high 
culty  of  one  test  to  another? 

II.   How  can  you  compare  the  average     low 
score  of  your  students  to  that 
of  students  statewide? 


high 


medium 


III.      If   all  twelfth  grade  students  very  low 

in  the  state  had   taken  your 
test,   what  percent  would  have 
passed  your    standard? 

IV.      How  can  you  make  this  year's  high 

standard   equivalent   to   last  year's 
standard   if   you  have  changed   the 
test? 


low-medium 


high 


V.      How  can  you  determine  which  con- 
tent  areas  are  your  district's 
relative  strengths  and  weaknesses? 

VI.      How  can  you  determine  the  content 
areas   in  which  your  district's 
scores  have  changed  from  last 
year? 


high 


medium-high 


high 


medium-high 


Thus,    taking   a   sample  of   convenience  not   only  would  bring   the  costs  for 
scaling   well   into   line  with  what   the  Bureau   could  afford,    but    it  would   permit 
the  questions  judged  as  more  desirable  to-  be  answered  with  a  high  degree  of 
precision.      The  questions  judges  as   less  desirable— those  relating   local 
averages  to   statewide  averages— could   be  answered  with  far   less  precision.      If 
the  demand  for  data   to  answer   such  questions   is  high  enough  in  the  future  to 
warrant   the  cost,    the  Bureau   could    scale  the  pool  on  the  representative   sample 
at   some  later  date. 


38 


XII.      CRITERIA  FOR  A  STATE   TEST 

When  the  pool   is  released,    each  district  using    it  will  be  permitted 
to   construct   from   the  pool  a   test   which  will   be  called   a   state   test   and 
will  not  require  special  approval  from   the  Department,    providing   the  dis- 
trict develops   its  test  according   to  certain  criteria.      These  criteria  will 
be  the  same  ones   the  Department  uses   to  develop   its  two   equivalent  forms 
each  year.      The  criteria  will  be  of   two   types:      one  set   that   controls   the 
content  of   the  test,    and  other   set   that  controls  the  statistical  properties 
of    the  test.      The  purpose  of   this   section  will  be  to  describe  what   the  two 
sets  of   criteria  must  be  like. 

Content  of   Test.      The  content  of    the  test   is  determined  by  choosing 
items  that  are  related   to   the  Basic   Skills  objectives.      The  objectives    (and 
items)    are  all   intercorr elated,    and   to   some  extent   they  are  interchangeable. 
However,    the  tests   should  be  constructed   so   that   they  have  a  balance  of   items, 
i.e.,    they  should  measure  all  of   the  objectives  and  not  have  too  many  measur- 
ing  any   one  objective.      There  are  14  objectives   in  reading  measuring    5  differ- 
ent  content  areas;    there  are  38  objectives   in  mathematics  measuring   6  content 
areas.      A  state  test  certainly   should   contain  some   items  measuring   each  of   the 
content  areas;    it  probably  also   should  have   items  measuring    each  objective. 
The  current  State  tests  measure  every  objective.      In  reading,    each  objective 
is  measured   by  between  2  and   6   items;    in  math,   by  between  1  and   6.      Some 
balance  of   this  nature  would  be  important   to  maintain.      A  workable  criterion 
would   be  to  require  that   each  objective   in  any  State  test  must  be  measured  by 
at   least   two   items   in  reading   and   1   in  math;    that  not  more  than  10  percent  of 
the  test  could  consist   of    items  measuring  any  one  objective;    and   that  not  more 
than  40  percent   of   the  test  could   consist   of    items  measuring  any  one  content 
area.       Although  the  individual   items  will  all  have  been  screened  for  racial, 
ethnic,    and   sexual  bias  prior   to   their   inclusion   in  the  item  pool,    it   is 
possible  that   the   set  of    items   selected   by  a  district  would   be  biased.      For 
instance,   male  characters  might   be  overrepresented    in  the  reading   passages. 
It   is   likely  therefore,    that  districts  would   be  required   to   systematically 
check  their   test  for  bias  using   a  checklist   provided   by   the  Department. 

Statistical  properties.      The  only  statistical  property  pertinent   to   the 
basic  application  of   the  State  Basic   Skills   test    is  the  standard   error  of 
measurement  at   the  cut-off    score  established  by  the  district.      If    the  standard 
error    is   large,    there  will  be  a   lack  of  precision  of   the  scores  of    students  who 
are  near   the  borderline;    that   is,    there  would  be  a   substantial  probability  that 
those  who  failed  would,    on  ret est,   pass,    and   conversely,    that   those  who  passed 
w culd  fail  a   second   testing.      If    the  standard   error    of  measurement   is   small, 
students   scoring  near   the  cut-off   will  be  classified  properly  more  often.     With- 
in reason,    the  goal   should  be  to  have  tests  developed   that  have  uniformly  low 
standard   errors  of  measurement  at   the  cut-off    score. 

The   standard   error  of  measurement    is  directly  related   to   the  difficulty  of 
the  items  on  the  test.      If   a  test   question   is  very   easy   or  very  hard  for   stu- 
dents  scoring  near   the  cut-off,    little  information   is  gained  from  their  answer, 
and  consequently,    the  standard   error   of  measurement  for   their   score  remains 
large.      To  minimize  the  standard   error   of  measurement  at  a  particular  cut-off 
score,    a  test  would   be  designed    so   that  a  just  minimally  competent   student 
would   have  a   50-50  chance  of  getting    every   item  right.      In  practice,    such  a 


39 


test   would   be  difficult   to   construct;    additionally,    the   test   would   be  con- 
sidered  a   little   too   hard,    and    since   the  cut-off    score  would   be  only   50%, 
there  would   be  an  appearance  that  districts  were   establishing   a   standard 
that  was  too   low. 

One  way  to  control   the  standard   error   of  measurement  at   the  cut-off 
score  would   be  to  require  people  to  choose  a  certain  number    items  of   a 
moderate  difficulty   level.      This  was   the  approach  taken  at   the  workshop. 
The  average  difficulty  was  required   to  be   in  a   specified  range,    and   the 
number   of   particularly  hard  and   easy   items  was  restricted  rather   severely. 
An  alternative  approach  is  more  flexible  but  a  bit  more  complicated.      This 
approach  would   enable  districts   to  use  a  large  number  of    items  with  a  wide 
range  of  difficulties   or  a   smaller  number  of    items  with  difficulties   closer 
to   the  cut-off    score.      There   is  no  need   to  prefer   one  approach  over   the 
other   at   this  time,    since  a  decision  about  which  one   is  better  would  not 
need   to  be  made  until   it   is   time  to   train  districts   in  the  use  of   the  pool. 


40 


XIII.       RECOMMENDATIONS 

The   purpose   of    this    section   will   be   to   describe    the    system   of    pro- 
viding   Basic    Skills    tests    that   RMC   views   as  most    feasible   for   Massachusetts, 
to   justify   that   choice,    to   describe   how   it  might   be   implemented,    and    to 
evaluate    it   on   the  basis    of    the   24    criteria   RMC   listed    in    its   original   pro- 
posal  to    the  Department. 

The   system  RMC   recommends   begins  with   the  development   of    an   item  pool. 
Using    the   items   developed    thus   far   as   a   base,    the  Department    should    expand 
that   base   through   the   purchase  of   available    items   and   through  additional 
item  writing,   until   the  pool    is   of    sufficient    size — around    600   items    in  read- 
ing   and    600   items    in  mathematics.      In    subsequent   years   the  Department    should 
also   consider   developing    an    item   pool    in   the  area   of    listening. 

Once   the  pool   has   been  developed,    the  Department    should   develop    two 
equivalent   forms   of    a   test    each  year   by  choosing    a    stratified   random    sample 
of    items   from   the  pool.      Camera-ready   copies    of    these   forms    should   be   pro- 
vided   to    school  districts   that    indicate    in   their   plans   they  will   be  using 
the   State   test.      To    local  districts,    therefore,    the  appearance  of    the  rec- 
ommended   system,    as  described    to    this   point,    will   be  much   the   same  as    the 
current    system. 

However,    RMC   also   recommends   that    the    item   pool   be  reproduced   and   dis- 
tributed   to    school   districts   at   a  workshop.      Along   with   the    items,    the 
Department    should   also   provide   item    statistics   and   a    set   of    the  requirements 
a    school   district  must   follow   in   order    to  use   the   item  pool.      The    item 
statistics   provided    initially   should   be  P-values    (the  proportion  of    students 
who   answered    the    item   correctly) ,    obtained   from   a    sample  of    convenience  of 
ninth  and    twelfth  graders   around    the   state.      A  recommended    sampling    procedure 
is  discussed    in   Chapter   XI.      The  Department    should  monitor    the   acceptance   of 
these   statistics   and   provide  R-values    (Rash   item  difficulty  values)    when 
feasible  at    some  future  date. 

RMC   recommends    that    the  writing    test    supplied   by   the  Bureau   be   expanded 
only   slightly  from  current    levels.      An  additional   twelve   items    should    be 
constructed — six    each  for    letters   and    essays.      Each  of    these   items   should   be 
field   tested   on  50   students   selected   from  those   in  the  10   schools    in  which 
the  mathematics  and   reading    items  will   be   scaled.      The  results   of   the  field 
testing    should    indicate  whether   the  directions  are  clear  and  whether   the 
tests   can  be  holistically   scored.      RMC  recommends    such  a   limited    expansion    in 
writing,    compared   to  reading   and  mathematics,    because  the  writing   items   can 
and    should   be  reused   from  year   to  year.      Comparisons  of   results  are  difficult 
when   the   topic   of    the  writing    item  changes,    and,    since    issues    surrounding 
teaching    to    the   test   differ   greatly   for   open-ended   and   for  multiple  choice 
questions,    it    is   appropriate   in   this   case   to  use  a  very   limited   number   of    test 
questions   when   constructing    the  writing    test. 

The   system   of    providing   districts   with  objectively   scored    tests  which 
RMC    is   proposing    is   best    summarized    in   terms   of    three  major   recommendations. 
(1)    that    the  Department   develop   and    scale   an    item   pool;    (2)    that    the  Depart- 
ment  develop   from   the   item   pool   two   equivalent   forms   of   a    test    each  year; 
and    (3)    that    the  Department   release   the   item   pool   for  use  by   local   districts. 
The  Remainder  of   this  chapter  discusses  how  each  of    these  recommendations 
should    be    implemented   and   how   the   system   as   a  whole  fares   against    twenty-four 
evaluation  criteria. 


41 


RECOMMENDATION    #1 

Development   of   an   Item  Pool 

RMC  recommends   that   the  Department   use  the  following   procedures   to 
develop   its   item  pool: 

1.  Identify  available   item  pools,    and   their  probable  quality,    coverage 
and   cost.      This   activity   has   been  undertaken,    in   part,    by  RMC  during 
the  current  year's   contract.      It    should   be  continued   throughout  next 
fall. 

2.  Determine  which  pool(s)    should  be  purchased.      The  objective   in  this 
step   is   to  put   together   the  best   combination  of   pools    (or   parts  of 
pools)    that  will  cover   the  most   objectives  at    the  least   cost,    hold- 
ing  the  quality  of    the  items  at  a  high  level.      It    is  possible  that 
a  point  will  be  reached  where  a  combination  of   pools  covers  most, 
but  not  all,    of   the  objectives,    and   the  addition  of   another   pool 
which  would  cover   some  of    the  remaining   objectives  would   be  prohibi- 
tively  expensive.      In  that   case,   new  items  would   be  written  to  measure 
these  objectives  not   covered   by  the  combination  of   pools. 

For   each  pool,    the  following    information  should   be  obtained: 

a.  Which  Basic   Skills  objectives  are  measured   by   items   in 
the  pool,    and  how  many   items  are  available  for   each  objective? 
Since  each  pool   is  written  from  a  different    set  of   objectives, 
it  would   be  necessary  to  determine  the  correspondence  between 
the  objectives   each  pool  used   and   the  Basic   Skills   objectives. 

The  Department  would   call  on  the   services  of    two  Review  Committees, 
one  for  reading   and  one  for  mathematics.      A  sample  of   the   items 
from   each  objective  would  be  examined   by  the  Review  Committee  to 
make   it   likely  that  the  items,    if   purchased,   would    indeed  measure 
the  objectives  as  claimed,    to   the  Review  Committee's   satisfaction. 

b.  For   each  pool  offered   for   a  price,    can  parts  of    it  be  pur= 
chased,    or  must   it   be  purchased    in  toto?      If   a   sufficient  number 
of   pools   can  be  purchased   piecemeal,    costs  might   be  substantially 
reduced,    since  items  from   one  pool  which  overlap   those  of  others 
could   be  left   out  of    the  purchase  agreement. 

c.  Have  the  items  been  field  tested?  If  the  items  in  any  pool 
have  not  been  field  tested,  RMC  recommends  that  the  pool  not  be 
considered  for  purchase,  since  the  quality  of  the  items  will  be 
too  uncertain. 

d.  What   is   the  offered   price  of   the  pool?      If   parts   of   the  pool 
can  be  purchased   separately,    what    is   the  price  of   each  part? 

e.  What  are  the  likely  costs   of   rejecting   or   revising    items   in 
the  pool?      The  review  conducted   by  departmental  review  committees, 
mentioned  above,    should   have  provided   an   estimate  of    the  extent 

to  which  items    in  the  pool  would   be  rejected   outright,    because 
they  do  not  measure  the  objectives   properly,    or   provisionally 
accepted,    because  the  Review  Committee  perceived    technical   flaws 
in   the   items. 


42 


f  .      What   period   of   ownership  will  Massachusetts   be  given,    and 
what   will   be   the  costs   of   replacing   those   items  when  ownership 
expires?     Massachusetts  will  be  looking   for  copyrights  to   the 
items  for  use  within  Massachusetts   only;    owners  of    the  pools 
will   still  retain  all  rights   to   the  pool  outside  of  Massachusetts. 
In  addition,    Massachusetts  will  be  considering   only  pools  which 
have  a  one-time  cost  associated  with  them;    no  arrangements  will 
be  entered    into  which  involve  periodic    (such  as  annual)    payments. 

3 .  Negotiate  purchase  of    item  pool(s) .      It   is  not   clear  whether   these 
negotiations   should   be  done  by   issuing   a  Request  for  Proposals  or 

by  direct  negotiations.      It  would  depend  on  how  certain  Massachusetts 
was  that  all  available  sources  had   been   identified,   and  whether   the 
initial  costs  discussed  were  fair  or   inflated. 

4.  Review  the   item  pools.      After   the  pools  have  been  purchased,    the 
Review  Committee  should  review  each  item,    and  place   it   in  one  of 
three  categories:      accept  as   is,    accept  with  revision,    reject.      Other 
appropriate  groups   such  as   the  Bureau   of   Equal  Educational  Opportunity 
should  also  review  all   items. 

5.  Revise   items  as  necessary.      Items   in  the  "accept  with  revision"  category 
should  be  rewritten  according   to   the  Review  Committee's   specifications 
and   then  reviewed   a   second   time.      Items  failing   the  second  review  should 
be  dropped. 

6.  Write  new  items  as  necessary.      New  items  would  be  written  to  cover 
two   types  of   objectives:      those  for  which  no   items  were  purchased, 
and   those  for  which  so  many   items  were  rejected  by  the  committee 
that  an   insufficient  number  measuring    the  objective  remain. 

7.  Screen  new  items.      The  new  items   should   be  subjected   to  the  same 
screening   process  as   the  other   items  were. 

8.  Field   test  all  new  items.      This  field   testing   should  be  done  on 
100  students  per   item   on  a  relatively  representative  sample  of 
convenience.      It   should   include  interviewing   students  as  well  as 
reviewing   test   results   in  order   to   insure  the  match  of   the   item  to 
the  objective  being  measured. 

9.  Revise  new  items  as  necessary. 

10.      Conduct  final   screening.      The  Review  Committee  should  meet   to  review 
all  revised   and  new  items.      The  pool,    at   this  point,    should  be  in 
nearly  final  form  and   ready  for  pilot   testing.      (Steps   11-17). 

After   the  pool  has  been  developed,   RMC  recommends   that   it   be  administered 
to   students   in  Massachusetts   in  order   to  make  a  determination  of    the  relative 
difficulties  of   the  items  and   to  obtain   information  as  to   the  correlation  be- 
tween each  item  and   other   similar    items.      The  latter  type  of    information  may 
lead   to   the  rejection  of    some   items  from  the  pool,    but   it   is  likely  that   this 
would   be  a  highly   infrequent   occurrence,   given  the  screening   and   field  testing 
process  the  pool  has  passed. 

At   this  time,    it   is  anticipated   that   the   item  pool  developed   in  Steps 
1=10  would  contain   600   items   each   in  reading   and  mathematics,    including   the 
over   200   items  already  written.      These  items   should   be  divided    into   twenty 
forms  and   each  form  should   be  administered   to  300   students   in  grades   9  and   12. 
Thus,    the  total   sample  size  should   be  6,000   students  per  grade. 


43 


The  pilot  testing  process  should  include  the  following  steps: 

11.  Construct  booklets.   The  800  items  (400  items  each  in  reading  and 
mathematics)  should  be  divided  into  20  sets  of  40  each,  and  the 
20  items  each  in  math  and  reading  to  be  used  for  linking  would  be 
selected.   Ten  booklets  each  in  mathematics  and  reading  should  be 
constructed . 

12.  Write  manual.   A  short  manual  for  the  administration  of  the  pilot 
tests  should  be  written. 

13c  Print  booklets.   The  twenty  booklets  should  be  typeset  and  printed 
so  that  they  are  similar  to  the  current  Basic  Skills  tests.   In 
each  booklet,  40  new  items  should  be  printed,  along  with  20  linking 
items  from  previous  Basic  Skills  tests.   The  20  linking  items  should 
be  common  to  all  10  mathematics  booklets  and  all  10  reading  booklets., 

14.  Contact  schools.   The  Department  should  select  a  sample  of  10  con- 
venient schools  and  contact  them  for  their  involvement  in  the  pilot 
testing.   The  schools  chosen  should  include  a  reasonable  balance 

of  size  and  ethnic  composition. 

15.  Administer  the  tests.   The  10  schools  should  be  visited  and  the 
tests  administered. 

16.  Score  the  tests. 

17.  Analyze  the  test  results.   The  test  results  should  be  analyzed 
using  Rasch  scaling  procedures.   Using  the  20  linking  items,  the 
relative  difficulty  levels  of  all  400  new  items  should  be  established 
for  both  9th  and  12th  grade.   The  difference  between  the  average 
achievement  levels  of  12th  graders  in  the  sample  and  12th  graders 
tested  statewide  last  year  should  be  determined,  and  the  Rasch  diffi- 
culty levels  for  all  items  should  be  adjusted  by  that  amount.   These 
estimated  statewide  Rasch  difficulty  values  then  should  be  translated 
into  estimated  statewide  12th  grade  P=-values. 

Similarly,  the  difference  between  the  average  9th  grade  achievement 
level  and  the  12th  grade  achievement  level  should  be  computed,  and 
used  to  estimate  statewide  Rasch  difficulty  values.   Whether  these 
differences  should  be  computed  separately  for  different  objectives 
should  be  determined,  and  if  appropriate,  different  adjustment  values 
should  be  determined.   The  estimated  statewide  Rasch  difficulty  values 
for  the  items  then  should  be  translated  into  estimated  statewide  9th 
grade  P-values. 

RMC  has  recommended  that  the  Department  develop  an  item  pool  because  this 
approach  will  cost  no  more,  and  quite  probably  will  cost  far  less,  than  the 
current  practice  of  developing  new  forms  from  scratch  as  they  are  needed.   Even 
if  public  distribution  of  the  pool  were  not  recommended,  it  still  would  be  cost 
effective  for  the  Department  to  build  a  pool  from  which  to  draw  its  tests.  To 


44 


date,    the   Department    has   developed    over    200    items    each    in   reading    and 
mathematics,    and    under    the    system    it    has   been   employing,    would    be  develop- 
ing   another    500   or    so    in    each  area   over    the   next    five   years.      Developing    each 
of    the   tests   from    scratch   each  year   would    be   a   costly   process.      There    is   no 
question   that   developing   a   pool   of    500   to    1,000   items   at    the  beginning    of    the 
next   five  year   cycle,    and    then  drawing    tests   from    that   pool    each  year,    would 
be  no  more   expensive   than   the   current   procedure  and  most    likely  would   be   far 
less   costly. 

RMC   has  recommended    that    the  pool   be  pilot    tested   on   a    sample   of   con- 
venience for    three   reasons: 

1.  The   cost   of    such   sampling   will   be  affordable.      Sampling    on  a    large 
fully  representative  group   of    students  would   be   too    expensive  an 
undertaking   for    the  Department   at    this   point,    even    if    the    sampling 
were  at   just    one  grade.      By  using    the  procedures   recommended,    the 
Department   will   obtain   estimates   of    item  difficulty  with  usable 
precision   at   two  grades   at   reasonable  cost. 

2.  The  data   collected   from   such  a    sample  will    enable  districts   to 
answer   all   of    the  most    important    questions   they  might   have  about 
their   results. 

3.  The  data   collected    from   such  a    sample  will   not   be   accurate   enough 

to  make  valid    comparisons    of   district   averages   to    statewide   results. 

RMC   recommends   that   P-values   estimated   from   the  above   sample  be   supplied 
to    schools   along   with   the  pool   of    items,    provided    that   the  Department    emphasizes 
that    these  P-values   are   estimates.      While   these  P-values  probably  will  accurate- 
ly reflect   the  relative  differences   in  the  difficulties  of    items,    they  cannot 
be    interpreted   as   P-values  usually  are.      That    is,    if    item  A  has   an   estimated 
P-value  of   85  and    item  B  has  an  estimated   P-value  of   80,    it    is  reasonable  to 
assume  that   if    those   two   items  were  administered   to   students   statewide,    5 
percent  more  would   get    item  A  right    than   item   B,    but   not   necessarily   that    the 
percentages  would   be  85   and    80  respectively.      The  recommendation   that   P-values 
be   provided   with  the  pool    is  made  because  P-values  were   the   highest   rated 
statistic   on  all    three   questions   posed   at    the  workshop.      With   the   learning 
that  must   take  place  at   the  beginning   of    the  operation  of   a    system   such  as   the 
one  proposed    in  this   section,    it  will  be  helpful   to  keep   the   statistics   as 
simple  and   familiar   as   possible,    at    least    in   the  beginning.      However,    because 
of    the   superior    statistical   properties   of    R-values,    RMC   recommends   that    the 
Bureau  monitor   the  use  of   the  pool,    and  when  use  of    the  pool  becomes  familiar 
and   comfortable,    it   should    introduce  R-values  as  the   statistic   to   eventually 
replace  P-values. 

RECOMMENDATION   #2 

Development   of   Two   Equivalent   Forms   Each  Year 

RMC   recommends    that    the  Department   develop   two   equivalent   forms   of    the 
test    each  year   and    provide  camera-ready  copies   of    those  forms   to   districts 
which   indicate   in   their   program  plan   that    they  will  be  using    the   State   test. 
RMC   recommends   that    the  Department   provide  camera-ready  copies   of    the   equiva- 
lent  forms   because  many   school  districts  need   and   have  come   to    expect    such 
materials   to   be   provided   at   the   secondary   level.      Additionally,    RMC  recommends 
the   development   of    two    equivalent   forms   because  feedback  from   school  districts 
indicated    that  many  need  more   than   one  form,    but    there   is  no   real  need    for 
more   than   two. 


45 


KMC   recommends    that    the   Bureau    construct    a    list    of    the   requirements    that 
any    test,    including    its    own    equivalent    forms,    must    follow  before    it    can   be 
considered   a    State   test.      These   requirements   must    include   constraints   on   the 
length   of    the    test,    the   proportion    of    items    that   can   be    selected    from   any    one 
objective,    and    the    standard    error    of   measurement    that   will   be   permitted    around 
any   reasonable  range   of    cut-off    criterion    scores.      This    last    requirement    can 
be   achieved  most    easily   by   restricting    the  mean   and   variance   of    the  difficulties 
of    the   items   selected   for    the   test,    but   as   noted    in   Section  XII    of    this   report, 
the  Department  may   offer    alternative  restrictions   which  would    provide  districts 
with  more  flexibility.      The  recommendation    is  made   to    insure   that   all    tests 
developed  under    these  guidelines  will   be  roughly    equivalent. 

Once  guidelines   have  been  developed,    equivalent   forms    should   be   assembled 
each  year   according    to   the  following    steps. 

1.  The  required   number   of    items   for    each  objective    should   first    be   ran- 
domly  selected   from   the  pool    in   such  a   way   that   a  given    item   would 
appear    in  only   one   of    the   two  forms. 

2.  The   total    set   of    items    should    then  be   checked    to    insure   that    the 
average  R-value  and   the  distribution  of   R-values   conform   to   the 
requirements   of   the      tate  guidelines. 

3.  If    the   items   selected   at   random  do  not   create  a   test  which  meets   the 
guidelines,    items  would   be   selectively  replaced  until   the  guidelines 
are  met.      Care  would   be  taken   to  replace  as   few  items  as  possible. 
No  review  of    item  content    should   be  necessary  because   items  were 
screened   before  entering    the  pool  and   because  districts   have  the 
option  of   replacing    items.      However,    the   total   test    should   be  reviewed 
for  bias. 

4.  Camera-ready  copies   of    the  two  equivalent   forms  would   be  printed   and 
distributed   to  participating  districts. 

RECOMMENDATION   #3 

Public  Release  of    the  Item  Pool 

RMC  recommends   that   the  entire  pool  of    items   be  provided    in  camera-ready 
copy   to  all   school  districts  who    indicate   in  their   program  plan   that    they  will 
be  using   the  State  test.      This  recommendation   is  made  for   several  reasons: 

1.  The  overwhelming  majority   of   respondents   to   the  workshop   question- 
naire favored   release  of   the  pool  and    indicated   that   they  could 
and   would  make   effective  use   of    it. 

2.  Releasing    the  pool  would    result    in  more  districts   perceiving   value 
in  the  materials   being   provided   by   the  Department. 

3.  Releasing   the  pool  would  make  the   test   development   process  more 
open  and   relieve  the  Department   of    the   burden   of    proving    that    every 
item    on   every   test    it   produces    is   fair   for   all    students. 


46 


4.      While   there   would    be    some   extra   costs   associated   with   the  public 
release   of    the   pool,    these   costs   would    be   counterbalanced    by    sub- 
stantial   savings    in    the  development   of    the    item   pool.       If    the 
Department   were  not    to   release   the   pool,    it   would    have   to    insure 
that   all   the   items  used    in    its    published    tests   were  perceived    as 
being    fair   for   all  groups   of    students   across   the   state.      With   the 
release   of    the  pool,    however,    such  costly   precision    is  unnecessary. 
Districts   would   have   the   option   of   replacing    any    item   on   the   test 
that    they  judge   to   be  unsatisfactory  with  another   preferred    item 
from    the   pool.      Consequently,    the   costs    of   constructing    and   field 
testing    the   pool   will   be  far    less    if    all   districts   have   access   to    it. 

RMC   recommends   that    the  Bureau    sponsor   workshops   for   district    staff   when 
the   item  pool   becomes   available,    and    that    the  pool   be  provided    only   to   those 
districts    that    send   representatives    to    the  workshop.      While  the  use  of    the 
item   pool  will   not   be   that    complex,    it   will   be  necessary  for   district    staff 
to   have  training    in   its  use  before   they   receive    it.      Without    such   training, 
misunderstandings   and   misuses   are  certain   to   occur   at   a   higher    rate. 

The  following    steps   are  recommended   for  release  of    item  pool: 

1.  Items    in   the  pool    should   be   arranged   by   objectives   and   degree  of 
difficulty. 

2.  Items  should   be  printed    in  a  form  which  permits  a  district   to    insert 
them    into    equivalent   forms   released    by   the  Department,    e.g., 
loose   leaf   notebook,    perforated   pages,    index   cards,    or   gummed 
labels.      Item  difficulties  should   be  provided    in   separate  tables. 

3.  A  workshop  should   be  developed  and   presented   to   train  districts    in   the 
use  of    the  pool.      This  workshop  should    include  a  discussion  of    the 
state  guidelines,    the  physical  and    statistical  processes   involved 
in  adding   or   replacing    items,    and  methods  of    interpreting   results. 
Hands-on  experience  in  using   the  pool  would   be  an  essential   ele- 
ment  of    this  workshop. 

4.  Copies   of    the   item  pool  should  be  distributed   only  to   districts  attend- 
ing  the  workshop. 

Evaluation   of    the  Recommended   System 

In  RMC's  proposal  to  the  Department,  there  were  twenty-four  issues  pre- 
sented for  use  in  evaluating  alternative  test  delivery  systems.  Following  is 
a  discussion  of   how  the  recommended    system  addresses   each  of   those  issues. 

1.      Degree   of    local  district    involvement.      The  proposed    system   leaves 
the   issue  of    local  district    involvement  up   to   each  district.      Since  the 
Department  will   be  providing    two   camera-ready  copies   of    equivalent   forms   to 
districts   each  year,    districts   that   desire  no  more   involvement    than   they  K 

currently  have  are  accommodated.      However,    districts   that  want    to  develop   a 
testing   program   beyond    the  minimum  would   have  materials   and    information 
available  that   would    enable   them   to   do    so. 


I 

I 
1 


47 


2.  Degree   of    local   district    choice.      The   proposed    system   provides   for 
a   great   deal    of    local   choice.      Districts   can   choose    to   do   no   more   than 
administer    the   equivalent    forms    as    they   come   from    the   Department,    or    they 
can  decide   to    exercise  any   of    a    large  number    of    options. 

3.  Local   district    preferences.      To    the    extent    that    RMC   has    had    an 
opportunity   to    identify    local   district   preferences,    this    system  meets    those 
preferences.      As    shown   earlier    in   the   report,    the  proposed    system   contains 
all    the   essential   and   desirable  features    that   an   optimal    system    should    have. 
In  addition,    the   system    is   responsive   to    the  opinions    expressed   by    the  work- 
shop  participants. 

4.  Amenability      to    lo.ngitudinal   research.      If    the  Department  maintained 

a   portion   of    the   item   pool    securely,    the   proposed    system  would    accommodate  most 
longitudinal  research   questions.      However,    if    all    the   pool   were  publicly 
released,    this    system  would   produce  misleading    longitudinal   results.      Changes 
in    item  P-values  would   not   necessarily    imply   an   overall   change    in   general 
achievement    levels,    since  with   the  pool   being   public,    instruction  might    be 
more   focused    on  just   those   items   and   not    on   others.      In   addition,    RMC   has 
recommended    that   the  Department   not   collect   data   on   a   representative   sample 
of    students   around    the   state,    because   of    budgetary   constraints   and    other 
reasons.      Without    such  data,    identifying    longitudinal    trends   would   be  diffi- 
cult   to   do. 

5.  Difficulty  of   obtaining   normative  data.      As  mentioned   above,    obtain- 
ing  accurate  normative  data   is  an  activity   that  would   cost  more  money   than 
the  Bureau   has   available   to    spend   at    this   time.      The   proposed    system  would 
involve   slightly  higher   costs   than   other    systems   for   obtaining    accurate  norma- 
tive data,    but   cruder    estimates,    obtained    through  a    sample   of    convenience, 

can  be  obtained   for   reasonable  cost.      Collecting   precise  normative  data  would 
be   expensive  for   any   system   that  might   be  adopted. 

6.  Degree  of   potential  use.      The  potential  use  of    the  proposed    system 
seems   to   be   extremely  high.      The   system  probably  would    be  used    by   all   who   are 
currently  using    the  State   tests.      In   addition,    the  workshop   results    indicate 
that  many  districts   that   are  not   now  using    the   State   tests   would  make  use   of 
the  proposed    system. 

7.  Impact   of    program   on  curriculum   and    classroom  practices.      The   impact 
that   the  proposed    system  will   have  on  curriculum  and   classroom  practices  will 
vary  from  district   to  district.      In   those  districts  where  there   is   a  desire   to 
have  a   testing   program  that   affects  curriculum  and   classroom  practices,    the 
materials  will  be  available  to   easily  develop    such  a   system. 

8.      Training   required.      Local   district   personnel  would    have   to    be   trained 
in   the  use  of    an   item   pool.      It    is   for    this   reason   RMC   has   recommended    that 
only  districts   which   send   representatives    to    training    workshops   be  given   copies 
of    the   items.      It    is   clear   from   our    experience    in   presenting    the   three   regional 
workshops    in  March,    however,    that    a   one-day   training   workshop  will   prov: 
adequate  preparation  for  use  of    the   pool. 


'ide 


48 


<9#       Availability   of    training   materials.      Other    than    the  materials    the 
Bureau    has   developed    for    its    current    system   and    RMC   has   developed    for    the 
workshops,    not   much    exists    that    would    help   districts    to   use   this    system 
effectively.       Such  materials   would    have   to    be   developed    as    the  demand   re- 
quired   and    Bureau   resources   permitted. 

10.  Degree   to   which   security   can   be  maintained.      The   proposed    system 
takes   no  unusual    steps    to  maintain    security.      Copies   of    the    item   pool  will 
not   be   reproduced    indiscr imanately;    one  copy   per  district   will   be  provided 
only   to    those  who   attend    the  workshops.      More    importantly,    however,    the 
system   addresses    the    security    issue   by   supplying    so  many    test    questions 
that  memorizing    the  answers  without   Understanding   how   they   are  derived   would 
be  a  monumental    task. 

11.  Adaptability   of    system    to    elementary   program.      The  proposed    system 
potentially  could    have  great   adaptability   to   the   elementary   program.      If    an 
item   bank   is   purchased    for  use,    as    seems   probable,    it    likely  would   have   items 
at   a  variety   of   grades.      Thus,    it    is   conceivable   that  much  of    the   elementary 
pool   could    be   obtained    at    the   same   time   that    the   secondary    items   are   pur- 
chased. 

12       Availability   of    test    items.      The    availability   of    test    items   from 
which   to   develop   a   pool   remains  unclear,    and    the  uncertainty  will  not   be 
removed   until  Massachusetts    seriously   enters    into   negotiations    to   purchase 
a   bank.      However,    it    has   been   pointed   out    earlier    in   this   report    that    the 
development   of    items   for    this    system   cannot   cost  more,    and   probably  will   cost 
less,    than   the  current    system. 

13.     Need   for    quality   of   available   items.      While  the   pool   must   contain 
items   of    high   quality   to   be   acceptable,    the   burden   to    insure   that    every 
item  meets   every  criteria   of   fairness   is   removed   from   the  Department's 
shoulders    in   the  current    system.      With   the  availability   of    an   item   pool, 
and    the  ease  with  which  the   equivalent  forms  of    the  State  test   can  be  modi- 
fied  by  replacing    individual    items,    there  no    longer    should   be   complaints 
about    individual    items   on   the   tests  directed   to   the  Department. 

14.  Convenience  of  scoring.  The  proposed  system  offers  no  advantages 
over  the  current  system  in  terms  of  scoring.  Each  district  would  still  be 
responsible  for   having    its   own  tests   scored. 

15.  Ease   of   assessing    quality   and    equivalence  of   forms.      With  the   list 
of   requirements   for   the  development   of    equivalent  forms   that   the  Department 
will  be   issuing,    all   tests  constructed  will  have   equal   quality  and  will  be 
equivalent . 

16.  Difficulties    in   obtaining    copyright   ownership.      This    issue  also 
relates    to   the  construction   of    the  pool  which,    again,    is   an  uncertain  process 
at    this   time.      However,    this    issue    is   addressed    in   the  discussion  on   the 
development   of    the  pool,    since  by   buying    the   items,    the  Department  would   be 
purchasing    the  copyright,    at    least   for  use  within  Massachusetts.      Again, 
should    there  be  nothing   of   value   to   buy,    the  Department   would   develop    its 
own    items. 


49 


17.   Adaptability  to  computerization.   While  the  proposed  system  could 
be  computer ized ,  it  is  not  recommended  that  this  be  done  for  the  reasons 
detailed  in  the  report — this  would  add  a  needless  layer  of  cost  and  complex- 
ity to  a  testing  program,  that  is  by  its  nature,  fairly  universal  and 
straightforward . 

IS  m     Adaptability  to  change  in  direction  by  the. Board  of  Education.   The 
proposed  system  is  as  flexible  as  any  system  could  be.   Changes  in  objectives 
can  be  accommodated  by  adding  items  to  the  pool  and  changes  in  grades  can 
be  accommodated  by  extending  the  system  to  lower  grades. 

19 (  Ability  of  system  to  mesh  with  unique  local  district  basic  skills 
requirements .   The  proposed  system  does  not  accommodate  unique  local  objec- 
tives.  However,  this  does  not  appear  to  be  a  problem  since  RMC  found  the 
Basic  Skills  objectives  to  be  widely  accepted.   The  proposed  system  does 
accommodate  unique  local  approaches  to  assessment  very  well. 

20.  Credibility  to  public.   The  proposed  system  has  great  credibility 
with  local  district  staff.   It  is  believed  that  this  system  will  be  equally 
well  received  by  the  general  public.   It  is  easy  to  understand,  controlled 
by  the  State,  and  yet  permits  substantial  local  autonomy. 

21.  Ability  to  MDE  to  monitor  local  district  implementation.   The  pro- 
posed system  offers  MDE  very  little  monitoring  capability.   Except  for 
obtaining  district  plans  and  written  assurances  that  State  guidelines  for 
test  construction  are  being  observed,  the  Department  would  not  be  monitoring 
local  district  implementation.   The  Department,  however,  could  require  dis- 
tricts to  submit  their  tests  for  approval  if  it  decided  to  monitor  imple- 
mentation more  closely. 

22.  Ability  to  compare  across  districts.   The  proposed  system  provides 
a  level  of  comparability  that  is  almost  equal  to  the  current  system.   Under 
the  current  system,  two  districts  that  both  administer  the  State  test  can 
compare  their  scores  directly.   Under  the  proposed  system,  two  districts 
developing  tests  from  the  pool  could  compare  their  results,  but  such  com- 
parisons would  not  be  direct.   The  two  tests  would  have  to  be  equated  using 
item  P -values  or  R-values  before  such  comparisons  could  be  made. 

23  .  Ability  of  system  to  be  revised  if  tests  become  a  high  school 
graduation  requirement.   If  successful  completion  of  the  tests  administered 
to  fulfill  the  requirements  of  the  Basic  Skills  Improvement  Policy  should 
become  a  high  school  graduation  requirement,  the  proposed  system  might  have 
to  be  revised.   The  first  concern  would  be  the  security  of  the  system.   The 
experience  the  Department  gains  the  first  few  years  of  implementing  this 
system  should  clarify  whether  the  assumption  made  about  security  in  planning 
this  system  works — that  is,  is  it  true  that  by  providing  a  large  number  of 
test  questions  the  Department  has  frustrated   attempts  to  memorize  the  test? 
If  so,  the  system  can  continue  unchanged.   If  not,  the  system  would  have  to 
be  revised,  either  by  supplying  fresh  items  for  the  pool  each  year,  tighten- 
ing access  to  the  pool,  or  by  recalling  the  pool  and  maintaining  it  securely 
within  the  Department. 


50 


24.   Compatibility  with  evaluation  criteria  in  "Evaluating  Basic  Skills 
Achievement . "   The  proposed  system  is  compatible  with  most  of  the  evaluation 
criteria  in  Implementation  Guide  #1 — Evaluating  Basic  Skills  Achievement,  a 
document  published  by  the  Department  in  April,  1979.   This  document  prescribes 
certain  desirable  information,  characteristics  and  properties  tests  should 
have.  While  many  of  those  properties  relating  to  total  test  scores  are 
irrelevant,  the  ones  pertaining  to  item  development  must  be  kept  in  mind  and 
clearly  addressed  as  the  pool  is  developed.   If  that  is  done,  the  Department 
will  have  no  difficulty  in  demonstrating  the  superiority  of  its  own  materials 
and  system  alternatives. 


APPENDIX  A 


Initially  Proposed  Systems 

In  its  proposal  of  July,  1979,  to  the  Massachusetts 
Department  of  Education,  RMC  Research  Corporation  suggested 
10  systems  which  might  be  used  to  fulfill  the  Department's 
obligation  to  provide  an  evaluation  instrument  to  local  school 
districts  under  the  Basic  Skills  Improvement  Policy.   This 
appendix  was  written  shortly  after  the  start  of  the  contract 
in  September  to  expand  and  specify  in  more  detail  what  those 
systems  were  and  how  they  would  operate.   It  was  review  of 
this  rather  technical  paper  that  led  to  the  writing  of  Appendix 
B,  which  is  a  distillation  of  the  issues  generated  by  these 
proposed  systems. 


A-l 


System  A — MDE  publishes   a   single   short    test 

How  system  would   work.      MDE  would   put   together   a   test   of   about    50   items 
in   each  content   area.      The  difficulties   of    the   items  would   run  over   a 
relatively   small  range,   and  be  fairly   easy  for   the  general  population,    say 
.75   -    .95.      MDE  would   norm   the   test   on  a  representative   sample  of    students 
in  the  state.      MDE  would  maintain   securely  a  parallel   50   item  test  on  which 
norms  also  were  obtained. 

Each  year,   MDE  would  develop  a  parallel   test.      This  parallel   test  might 
be  normed,    but  more  likely  would   be  equated   to  the  secure  form.      Every  few 
years    (say  5)    the  entire  process  would  be  redone,   with  a  new  secure  test 
being  developed  and  normed. 

How  system  would   be  used   by  a  district  desiring  minimal    involvement. 
Such  a  district   would   determine  a   cut-off    level   by  using   any   or   all   of    several 
pieces   of    information: 

1.  Statewide  normative  data   on  total  test   scores. 

2.  Published  difficulty   levels   on   individual    items. 

3.  Judgments  about  difficulty  levels. 

4.  Local  performance   on  the  state  test   in  previous  years. 

Testing   from  this   program  could   be  used   to  meet   the  testing  and  reporting 
requirements  of   the  Basic   Skills  Improvement  Policy.      The  value  of   these 
results   to  assist   in  diagnosing   student  needs  and    in   identifying   curriculum 
areas  which  should   be  adjusted   or  modified    is   less  clear. 

How  system  could   be  modified   by  the  LEA  if   traditional   statistics  were 
supplied .      The  LEA  could  replace  individual   items   in  the  test  with  other    items 
having   the  same  point-biserial  and  difficulty.      If   a  minimum  number  were 
exchanged    (say,    10%  or   less),    the  normative  data   supplied  for   the  original 
test  would   be  a  close  approximation  to   the  true  normative  data  of   the  revised 
test.      To  field   test   the   items   to  be  added,    the  LEA  would   have  to  get  a  norma- 
tive sample  that  resembled   the  Massachusetts   sample,   not  a  representative 
sample  of   the  LEA. 

Additional  ways   system  could   be  modified   by   the  LEA   if   Rasch  statistics 
were  supplied.      The  LEA  could  replace   individual   items   including   the  state 
tests,    or  add  new  items   to   it,   as  appropriate.      There  would   be  no  restriction 
on  the  difficulty  of    these  new  items,    so   that  an  LEA  could  make  the  test  more 
difficult   if    it  wanted   to  and   still  compare  its  results   to  Massachusetts  norms. 
The   items  would   have  to  fit   the  same  latent   traits  as  the  original   items.      The 
items  could   be  scaled   on  any  reasonable  sample  of   convenience.      The  LEA  would 
have  to  make  all  the  Rasch  calculations   on  their  new  items.      The  revised   test 
could   be  significantly   longer   than  the  original   test  and  contain  new  content 
areas,    so  long   as   those  new  areas  measured   the   same  latent   trait  as  the  old 
test.      Consequently,    the  test   could   be  used  for   student  diagnosis,    but  not 
curriculum  revision. 

The  LEA  could  reduce  administration   time  by  administering    to    students  only 
those  items   that  were  of   appropriate  difficulty.      If    it   chose  to  administer  the 
same  test   to  all   students,    it   could   construct   its  test  by  selecting   the  items 
that  had   the  most  appropriate  difficulty   levels  for   their   students. 


A-2 


System  B — MDE  publishes  a  single  long  test 

How  system  would  work.   This  system  would  operate  much  as  System.  A, 
where  MDE  puts  together  a  test  of  about  50  items  in  each  content  area,  except 
that  the  test  would  be  much  longer--say,  200  items.   The  longer  test  would  be 
constructed  with  two  major  differences  from  the  shorter  test:   the  items 
would  range  far  more  widely  in  difficulty,  and  cover  more  content  areas  in 
greater  depth.   This  longer  test  probably  would  have  the  shorter  test  described 
in  System  A  (50  items)  imbedded  in  it  as  a  subtest  of  items.   The  longer  test 
would  be  constructed  by  simply  adding  an  additional  150  items,  most  of  which 
would  be  more  difficult.   The  content  areas  which  were  measured  by  only  a  few 
items  in  the  shorter  test  would  be  filled  out,  and  some  content  areas  left  out 
in  the  shorter  test  would  be  added. 

As  in  System  A,  MDE  would  norm  this  test  on  a  representative  sample  of 
students  in  the  state.   MDE  also  would  maintain  a  secure  set  of  items,  but 
not  necessarily  a  full  200  item  set.   MDE  probably  would  want  to  develop  a 
parallel  test  each  year,  although  the  need  for  publishing  a  new  test  annually 
would  not  be  as  strong  with  the  longer  test  as  it  would  be  with  a  shorter  one. 
As  in  System  A,  the  entire  process  would  need  to  be  restarted  every  5  years  or 
so. 

How  system  would  be  used  by  a  district  desiring  minimal  involvement.   The 
district  would  determine  a  cut-off  level  much  as  it  did  in  System  A,  by  using 
any  or  all  of  the  following  types  of  data: 

1.  Statewide  normative  data  on  total  test  scores. 

2.  Published  difficulty  levels  on  individual  items. 

3.  Judgments  about  difficulty  levels. 

4.  Local  performance  on  the  state  test  in  previous  years. 

The  only  problem  here  is  that  if  the  LEA  chose  to  establish  a  cut-off  level 
by  reviewing  individual  items,  there  would  be  four  times  as  many  to  review. 
The  advantage  of  the  longer  test  is  that  the  results  could  be  used  to  assist 
in  diagnosing  student  needs  and  in  identifying  curriculum  areas  which  should 
be  adjusted  or  modified,  as  well  as  to  meet  the  testing  and  reporting  require- 
ment. 

How  system  could  be  modified  by  the  LEA  if  traditional  statistics  were 
supplied .   LEAs  could  replace  individual  items  in  the  test  with  the  same  con- 
straints as  in  System  A: 

1.  The  number  of  replacements  could  be  not  more  than,  say,  10%. 

2.  The  new  items  would  have  to  have  the  same  difficulty  and  point- 

biserials  as  the  items  they  were  replacing. 

3 .  The  norming  on  these  new  items  would  have  to  be  done  on  a  sample 

of  students  that  resembled  the  Massachusetts  sample,, 

Since  the  test  is  longer  in  System  B  (200  items)  than  in  System  A  (50 
items),,  the  LEA  has  much  more  choice  of  items  to  replace*   On  the  other  hand, 
with  a  longer  test,  they  are  likely  to  find  more  items  they  will  want  to  replace 


A-3 


Additional  ways  system  could  be  modified  by  -the  LEA  if  Rasch  statistics 
also  were  supplied.   As  in  System  A,  the  LEA  could  make  significant  revisions 
to  the  test.   However,  since  additional  curriculum  areas  would  be  included 
in  the  longer  test,  the  LEA  could  use  their  revised  or  expanded  edition  of 
the  test  for  curriculum  revision,  as  well  as  diagnosis. 

System  C — MDE  publishes  multiple  forms  of  a  short  test 

How  system  would  work.   MDE  would  construct  between  three  and  five  tests 
of  approximately  50  items  each.   Each  of  these  tests  would  be  similar  to  the 
one  described  in  System  A;  i.e.,  the  difficulties  of  the  items  would  run  over 
a  relatively  small  range,  and  be  fairly  easy  for  the  general  population,  say, 
.75  -  .95.   They  would  be  constructed  to  be  parallel  to  each  other.  MDE  would 
either  simultaneously  norm  the  tests  on  representative  samples  of  students  in 
the  state,  or  would  norm  one  of  the  tests  and  equate  the  others  to  it.   One 
additional  parallel  test  would  remain  secure. 

Each  year,  MDE  would  remove  one  of  the  tests  from  circulation  and  add 
a  new  test  which  would  be  equated  to  the  secure  test.   Every  5  years  or  so, 
a  new  secure  test  would  be  developed  and  normed . 

How  system  would  be  used  by  a  district  desiring  minimal  involvement. 
Districts  desiring  minimal  involvement  would  select  one  of  the  tests  for  use 
each  year,  typically  by  a  representative  advisory  panel.   As  in  System  A,  a 
cut-off  level  would  be  determined  by  using  any  or  all  of  several  pieces  of 
information: 

1.  Statewide  normative  data  on  total  test  scores. 

2.  Published  difficulty  levels  on  individual  items. 

3.  Judgments  about  difficulty  levels. 

4.  Local  performance  on  the  state  test  in  previous  years. 

The  results  could  be  used  to  meet  only  the  testing  and  reporting  requirements 
of  the  Basic  Skills  Policy. 

How  system  could  be  modified  by  the  LEA  if  traditional  statistics  were 
supplied .   LEAs  could  put  together  their  own  test  using  either  of  two  methods: 

1.  By  substituting  items  in  the  form  they  have  chosen  as  in  System  A; 
i.e.,  they  could  replace  individual  items  in  the  test  with  other 
items  and  still  use  the  original  normative  data,  providing  the 
replacement  items  had  the  same  point-biserial  and  difficulty  as  the 
originals,  and  if  a  minimum  number  were  exchanged  (say,  10%  or  less) . 

2.  By  inserting  parallel  items  from  another  form.   In  using  this  latter 
approach,  the  LEA  would  be  envisioning  the  multiple  forms  of  the  test 
as  a  stratified  item  pool— 50  sets  of  items  with  3-5  items  in  each 
set.   If  the  items  truly  are  parallel,  the  LEA  could  construct  a 
parallel  form  of  its  choice  by  selecting  any  one  item  from  each  of 
the  sets.   The  advantage  of  this  approach  as  opposed  to  that  outlined 
in  System  A  is  that  the  LEA  would  neither  have  to  write  new  items 
that  had  matching  statistics  nor,  more  importantly,  have  to  norm  them 
on  a  sample  representative  of  Massachusetts. 


A-4 


Additional  ways    system   could   be  modified   by   the  LEA   if   Rasch   statistics 
also  were  supplied.      There  are  no  practical  differences  from  System  Ac      The 
LEA  could   replace  individual   items  or   add    items.      There  would   be  no  restric- 
tion on  the  difficulty  of    these   items,    so   that   an  LEA  could  make  the  test  more 
difficult   if    it  wanted   to  and    still  compare   its  results   to  Massachusetts  norms. 
The   item  would  have  to  fit   the   same  latent   traits  as   the  original   items.      The 
items  could   be  scaled   on  any  reasonable   sample  of   convenience.      The  LEA  would 
have  to  make  all   the  Rasch  calculations  on  their  new  items.      The  revised   test 
could  be   significantly  longer   than  the  original   test  and   contain  new  content 
areas,    so   long   as   those  new  areas  measured   the   same  latent   trait   as   the  old 
test.      Consequently,    the  test  could   be  used  for    student  diagnosis,    but  not 
curriculum  revision. 

The  LEA  could  reduce  administration  time  by  administering    items   to   stu- 
dents  that  were  of   appropriate  difficulty.      If    it  chose  to  administer   the 
same  items  to  all   students,    it   could   construct   its   test   by  selecting    those 
items   that   had   the  most  appropriate  difficulty  levels  for   their   students. 

System  D — MDE  publishes  multiple  forms  of   a  long   test 

How  system  would  work.      MDE  would   construct   between  three  and   five  forms 
of   a   test    similar   to   the  one  described    in  System  B.      This  would   be  a   test   of 
perhaps   200   items,    having   two  major  differences  from  the   shorter   test   of    50 
items:      the  items  would   range  far  more  widely   in  difficulty,    and   cover  more 
content  areas   in  greater  depth.      As   in  System  C,   MDE  would   construct  an  addi- 
tional parallel  form  which  would  be  kept   secure'.      Again,    either  the  forms 
would  all  be  normed ,    or   the   secure  form  would   be  normed   and   the  others   equated 
to   it. 

There  probably  would  not   be  a  need   to  remove  one  form  from  circulation 
every  year,    but   the  forms  would   need   to  be  revised   on  some  sort   of   continuing 
basis.      As   in  the  earlier   system,    the  secure  test  would   be  revised   and  re- 
normed   every  5  years. 

How  system  would  be  used  by  a  district  desiring  minimal  involvement.  Such 
a  district  would  proceed  as  in  System  A  by  determining  a  cut-off  level  by  using 
any  or   all  of    several  pieces  of    information: 

1.  Statewide  normative  data  on  total  test    scores. 

2.  Published  difficulty   levels  on   individual   items. 

3.  Judgments  about  difficulty  levels. 

4.  Local  performance  on  the  state  test   in  previous  years. 

The  cut-off    level  would   be  determined  much  the  same  way  as  for   System  B. 
As  was  true  in  that   system,    the  problem  here   is  that    if   the  LEA  chooses   to 
establish  a  cut-off    level  by  reviewing    individual   items,    there  are  four   times 
as  many  to  review  as  there  would   be  for   the  shorter   test.      The  advantage  of 
the  longer   test    is  that   the  results  could   be  used   to  assist    in  diagnosing    stu- 
dent needs  and   in   identifying   curriculum  areas  which  should   be  adjusted   or 
modified,    as  well  as   to  meet   the  testing   and   reporting   requirements    established 
by  the  Basic   Skills  Policy. 


A-5 


As  in  System  C,  districts  desiring  minimal  involvement  would  select 
one  of  the  tests  for  use  each  year,  typically  by  a  representative  advisory 
panel. 

How  system  could  be  modified  by  the  LEA  if  traditional  statistics  were 
supplied .   As  in  System  C,  where  LEAs  could  put  together  their  own  test  by 
substituting  items  in  any  one  form.   Also  since  the  items  in  the  different 
forms  would  be  equivalent,  a  district  could  construct  a  new  form  by  select- 
ing one  item  from  each  of  the  200  sets  of  3-5  items.   Again,  the  advantage 
of  this  approach  is  that  the  LEA  would  not  have  to  construct  new  items  or 
norm  them  in  order  to  be  able  to  construct  a  new  form  and  interpret  the 
results  arising  from  its  use. 

Additional  ways  system  could  be  modified  by  the  LEA  if  Rasch  statistics 
also  were  supplied.   There  are  no  practical  differences  from  System  B,  in 
that  the  LEA  could  make  significant  revisions  to  the  test.   Since  additional 
curriculum  areas  would  be  included  in  the  longer  test,  the  LEA  could  use 
their  revised  or  expanded  edition  of  the  test  for  curriculum  revision,  as 
well  as  diagnosis. 

System  E — MDE  publishes  a  short  test  and  a  long  test 

How  system  would  work.   MDE  would  construct  a  long  test  (say  200  items) 
as  well  as  a  short  test,  which  probably  would  be  merely  a  subset  of  the  200 
items  in  the  longer  test.   The  tests  would  be  like  those  described  in  System 
A  and  B;  i.e.,  the  difficulties  of  the  items  in  the  short  test  would  run  over 
a  relatively  small  range,  and  be  fairly  easy  for  the  general  population,  say 
.75  -  .95.   The  longer  test  (with  perhaps  200  items),  would  be  constructed 
with  two  major  differences  from  the  shorter  test:   the  items  would  range  far 
more  widely  in  difficulty,  and  cover  more  content  areas  in  greater  depth. 
Norms  would  be  obtained  for  both  tests.   Another  parallel  long  test  would  be 
constructed  and  maintained  as  a  secure  form. 

New  tests  probably  would  be  developed  each  year  although  as  pointed  out 
in  the  discussion  of  System  B,  it  might  not  be  necessary  to  do  so.   A  new 
secure  form  would  be  constructed  and  normed  every  5  years  or  so. 

How  system  would  be  used  by  a  district  desiring  minimal  involvement.   The 
short  form  would  be  used  as  described  in  System  A,  where  the  districts  would 
determine  a  cut-off  level  by  using  any  or  all  of  several  pieces  of  information: 

1.  Statewide  normative  data  on  total  test  scores. 

2.  Published  difficulty  levels  on  individual  items. 

3.  Judgments  about  difficulty  levels. 

4.  Local  performance  on  the  state  test  in  previous  years. 

Testing  from  this  program  could  be  used  to  meet  the  testing  and  reporting 
requirements  of  the  Basic  Skills  Improvement  Policy,  as  well  as  to  assist  in 
diagnosing  student  needs  and  in  identifying  curriculum  areas  which  should  be 
adjusted  or  modified. 


A- 6 


How  system  could  be  modified  by  the  LEA  if  traditional  statistics  were 
supplied .   If  the  LEA  wanted  to  modify  either  test,  it  would  be  done  by 
replacing  individual  items  in  the  test  with  other  items  having  the  same  point- 
biserial  and  difficulty.   So  long  as  a  minimum  number  were  exchanged  (say,  10% 
or  less),  the  normative  data  supplied  for  the  original  test  would  be  a  close 
approximation  to  the  correct  normative  data  for  the  revised  test.   To  test  the 
items  to  be  added,  the  LEA  would  have  to  get  a  normative  sample  that  resembled 
the  Massachusetts  sample,  not  a  representative  sample  of  the  LEA. 

Additional  ways  system  could  be  modified  by  the  LEA  if  Rasch  statistics 
also  were  supplied .   The  forms  would  be  modified  as  described  in  Systems  A 
and  B;  ice.,  the  LEA  could  replace  individual  items  or  add  items.   There  would 
be  no  restriction  on  the  difficulty  of  these  new  items,  so  that  an  LEA  could 
make  the  test  more  difficult  if  it  wanted  to  and  still  compare  its  results  to 
Massachusetts  norms.   The  items  would  have  to  fit  the  same  latent  traits  as 
the  original  items.   The  items  could  be  scaled  on  any  reasonable  sample  of 
convenience.   The  LEA  would  have  to  make  all  the  Rasch  calculations  on  their 
new  items.   The  revised  test  could  be  significantly  longer  than  the  original 
test  and  contain  new  content  areas,  so  long  as  those  new  areas  measured  the 
same  latent  trait  as  the  old  test. 

The  LEA  could  reduce  administration  time  by  administering  items  to  stu- 
dents that  were  of  appropriate  difficulty.   It  if  chose  to  administer  the 
same  items  to  all  students,  it  could  select  those  items  that  have  appropriate 
difficulty  levels  for  their  students. 

System  F — MDE  publishes  multiple  forms  of  both  the  long  test  and  the  short 
test . 

How  system  would  work.  MDE  would  construct  between  three  and  five  forms 
of  a  test  of  perhaps  200  items,  similar  to  the  one  described  in  System  B. 
Shorter  tests,  like  the  ones  described  in  Systems  A  and  C  (50  items),  would  be 
derived  from  the  longer  tests.   Another  longer  test  would  be  developed  and 
maintained  securely.   The  tests  would  be  normed,  probably  by  norming  one  of 
the  long  forms  and  equating  all  the  others  to  it. 

With  so  many  tests  available,  it  is  difficult  to  predict  what  rotation 
schedule  MDE  would  find  necessary.   It  seems  likely  that  the  whole  system  could 
stay  in  place  for  3  to  5  years.   Rather  than  revising  the  whole  system  every 
3-5  years,  however,  it  might  be  better  to  introduce  one  new  form  and  phase  out 
the  oldest  form  every  year. 

How  system  would  be  used  by  a  district  desiring  minimal  involvement.   As 
in  System  C,  districts  desiring  minimal  involvement  would  select  one  of  the 
tests  for  use  each  year,  typically  by  a  representative  advisory  panel.   A  cut- 
off level  would  be  determined,  and  results  would  be  used  to  meet  testing  and 
reporting  requirements c 

How  system  could  be  modified  by  the  LEA  if  traditional  statistics  were 
supplied.   As  in  System  C,  LEAs  could  put  together  their  own  test  by  sub- 
stituting items  in  any  one  form  in  the  test  with  other  items  having  the  same 
point-biserial  and  difficulty  if  a  minimum  number  were  exchanged  (says  10% 


A-7 


or  less).   Also,  if  the  items  in  the  different  forms  were  equivalent,  a 
district  could  construct  a  new  form  by  selecting  one  item  from  each  of  the 
200  sets  of  3-5  items. 

Additional  ways  system  could  be  modified  by  the  LEA  if  Rasch  statistics 
also  were  supplied.  With  this  many  items  available  for  use,  it  is  unlikely 
that  many  districts  would  want  to  construct  additional  new  ones.   However,  if 
they  did,  they  could  include  them  in  the  test  using  the  process  described  in 
the  other  systems.   They  also  would  have  the  flexibility  of  dividing  up  the 
forms  and  constructing  new  ones.   The  new  forms  could  be  more  or  less  difficult 
than  the  original  ones,  and  contain  few  subcontent  areas. 

System  G — MDE  publishes  a  pool  of  items 

How  system  would  work.  MDE  would  publish  a  large  pool  of  items.   The 
pool  most  likely  would  be  a  collection  of  items  from  various  sources — some 
written  by  MDE  specifically  for  this  application,  some  obtained  from  old  tests, 
others  taken  from  pools  developed  by  other  agencies.   Ideally,  each  item  in 
the  pool  would  be  referenced  to  an  objective  in  the  Basic  Skills  Improvement 
Policy.   The  pool  would  be  so  large  as  to  make  security  a  minimal  issue,  since 
it  can  be  argued  reasonably  that  anyone  who  can  master  all  the  items  in  the 
pool  has  reached  a  level  of  educationally  competence. 

In  addition  to  the  pool,  MDE  would  maintain  a  secure  set  of  items  that 
would  be  similar  to  the  longer  tests  described  in  System  B.   These  items  would 
be  used  to  evaluate  the  impact  of  the  Basic  Skills  Improvement  Policy  and 
examine  longitudinal  trends  in  the  state. 

How  system  would  be  used  by  a  district  desiring  minimal  involvement. 
Districts  would  develop  a  test  by  drawing  items  from  the  pool.   Typically, 
this  procedure  would  be  carried  out  by  an  advisory  committee  with  broad  repre- 
sentation.  After  choosing  the  items,  a  cut-off  score  would  be  determined  much 
in  the  manner  of  other  systems,  using  both  normative  data  and  internal  judgments, 
Testing  from  this  program  could  be  used  to  meet  all  the  requirements  of  the 
Basic  Skills  Improvement  Policy. 

A  small  modification  to  this  system  would  have  MDE  publishing,  along  with 
its  pool,  a  suggested  50  item  test  and  a  suggested  200  item  test.   All 
necessary  normative  computations  could  be  accomplished  by  the  state.   In  this 
modification,  a  district  unwilling  to  or  capable  of  choosing  its  own  test  simply 
could  adopt  one  of  the  two  suggested  state  tests.   The  state  would  reconstruct 
these  two  tests  each  year  by  drawing  another  set  of  items  from  the  pool.   In 
this  modification,  the  state  essentially  would  be  acting  as  a  model  district. 
LEAs  that  wished  to  follow  the  example  of  the  model  would  be  free  to  do  so. 

How  system  could  be  modified  by  the  LEA  if  traditional  statistics  were 
supplied.   If  so  desiring,  an  LEA  could  write  other  items  to  add  to  the  pool 
(the  state  might  be  wise  to  establish  a  system  for  collecting  these  items  and 
adding  them  to  its  pool).   At  this  point,  however,  a  district  would  be  develop- 
ing its  own  test  with  no  normative  data  available;  such  a  district  might  be 
better  off  constructing  its  own  locally-developed  test  from  scratch. 


A-8 


Additional  ways    system  could   be  modified   by   the  LEA   if   Rasch   statistics 
also  were  supplied.      A  district   could  add   additional   items   to   the  pool*      With 
Rasch  statistics   supplied,    however,    normative  comparisons  would    still  be 
possible. 

System  H-- MDE  publishes  a   single   short   test,    but  provides  additional   items  on 
request . 

How  system  would  work.      MDE  would   put   together   a   test   of   about   50   items 
in  each  content  area.      The  test  would   be  just   like  the  one  described    in  System 
A.      However,    if   a  district  were  dissatisfied  with  an   item,    it   could   contact 
MDE  for  an  equivalent  replacement.      If    it   found   there  were  objectives   that   it 
wanted   to  measure  that  were  not   included   on  the  test,    it  would   request    items 
to  measure  those  objectives  from  MDE. 

The  Department  would    either  hire  its  own   item  writers   or  would   act  as  a 
broker,  maintaining   a  file  of  writers  for  different   content  areas.      As  requests 
came  in,   MDE  would  put   the  writers   in  contact  with  the  requesting  districts. 

How  system  would  be  used  by  a  district  desiring  minimal  involvement .  Un= 
less  a  district  wanted  additional  items  written,  it  would  operate  as  described 
in  System  A. 

How  system  could   be  modified   by  the  LEA   if    traditional   statistics  were 
supplied .      The  test   could   be  modified   by  addition  or   replacement,   using   the 
items  requested  from  MDE.      An  LEA  would   have  to  complete  the  same  analyses 
as  those  described    in  System  A. 

Additional  ways   system  could   be  modified   by  the  LEA  if   Rasch  Statistics 
also  were  supplied.      Again,    this   system  would   operate   identically  to   System  Ac 
The  advantage  of   having   Rasch  statistics   is  that  new  items  could   be  added  with- 
out renorming   the  test. 

System  I — MDE  publishes  a   single  long   test,    but  provides  additional   items  on 
request . 

How  system  would  work.      MDE  would   put   together   a   test   of   perhaps   200   items, 
as   in  System  B.      However,    if   a  district  were  dissatisfied  with  an   item,    it 
could  contact  MDE  for  an  equivalent  replacement.      If    it  found   there  were  objec- 
tives  that    it  wanted   to  measure  that  were  not    included   on  the  test,    it  would 
request   items  to  measure  those  objectives  from  MDE. 

As  in  System  H,  the  Department  would  fulfill  its  responsibility  either  by 
hiring    its   own  consultants  to  write  the   items  or   by   serving  as  a  broker. 

How  system  would  be  used  by  a  district  desiring  minimal  involvement.  Un- 
less a  district  wanted  additional  items  written,  it  would  operate  as  described 
in   System  B„ 


A- 9 


How  system  could   be  modified   by   the  LEA   if    traditional   statistics 
were   supplied.      As   in   System  H,    the  test   could   be  modified   by  replacement 
or   addition,   using    the   items  requested   from  MDE.      Again,    an  LEA  would   have 
to   complete  the   same  analyses   as   those  described    in  System  A. 

Additional  ways   system   could   be  modified   by   the  LEA   if   Rasch   statistics 
also  were   supplied.      As  pointed   out    in  System  A,   new  items  could  be  readily 
inserted    into   the  test  without  going   through  the  process  of   renorming . 

System  J — MDE  publishes   separate  tests  for  each  district. 

How  system  would  work.      Under   this   system,   MDE  would   supply  each  district 
with  a  catalog   of   objectives,    or  alternatively,   with  a  complete  listing   of   all 
the  items  available   in  an   item  pool.      Each  district  would  determine  what   it 
wanted    its   test   to   look  like — how  many   items,    how  difficult,    how  many   items 
for   each  objective,    perhaps   even  which  items   it  wanted  chosen  for    its   test — 
and   submit  an  order   to  a  central   test  production  facility.      The  center  would 
produce  a  copy  of    the  test   as  requested,    and   provide  that   back  to  the  district. 
An  array  of    statistics  could   be  provided   back  with  the  test:      how  it   compared 
to   the  test   the  district  had  ordered   the  previous  year;   what   the  cut-off 
score  on  this   test   should   be  to  make   it   equivalent   to   the  cut-off    score  on  the 
district's  previous  test;    what   percentage  of    students   statewide  might   be 
expected   to  achieve  a   score  below  any  given  raw  score  on  the  test;    how  students 
throughout   the   state  would   score  on  that  district's  test,    objective  by  objec- 
tive;   and  just  about  any  other  comparative  result   that  one  could   think  of . 
Thus,    one  major  feature  of   a  computerized   system   is  not  only  the  way  the  tests 
can  be  put   together,    but  also   the  statistics   that   can  be  supplied   along  with 
the  test. 

A  simple  extension  of    this  concept  would   tie  in  a   scoring   service  to   the 
test  construction   service.      As   the  test    is  constructed,    a  record    is  made  of 
the  items  used   and   their  correct  responses.     When  the  answer   sheets  are  returned, 
the  tests  can  be  scored,   and  a  variety  of    information,    both  criterion-referenced 
and  normative,   can  be  returned  with  the  test  results.      There  will  be  no  attempt 
here  to  detail  all  the  ways  that  results   could   be  reported;    suffice  to   say 
that  virtually  any  format  and   selection  of    statistics  could  be  used. 

How  system  would   be  used   by  a  district  desiring  minimal   involvement.      The 
system  would  be  used   the  same  way  by  all  districts.      Each  would   order  a   test, 
receive  back  a  camera-ready  copy  and  have  their  booklets  printed. 

How  system  could   be  modified  by  the  LEA  if   traditional   statistics  were 
supplied .      As  n^ted  above,    the  LEA  would   have  no  reason  to  modify  the  system. 
The  tests   supplied  would  be  designed   to  be  just  what   the  district  wanted. 


APPENDIX  B 

Issues  in  Choosing  a  System 

After  RMC  staff  had  written  Appendix  A,  it  was  clear  that 
there  were  several  issues  that  would  need  to  be  discussed  with 
the  eight  school  districts  that  were  going  to  be  visited. 
Consequently,  before  those  meetings  began  in  November,  1979,  a 
paper  on  the  issues  identified  to  that  point  was  written  and 
sent  to  the  participating  districts.   This  appendix  is  a  revision 
of  that  paper. 


B-l 


Issues  Related  to  the  Dissemination 

of  Tests  by  the  Massachusetts  Department  of  Education 

for  the  Basic  Skills  Improvement  Policy 

The  Massachusetts  State  Board  of  Education  has  adopted  a  Basic  Skills 
Improvement  Policy.   Under  this  policy,  all  school  districts  in  Massachusetts 
must  initiate  an  evaluation  program  to  measure  the  extent  to  which  their 
students  have  mastered  the  basic  skills  of  reading,  writing,  listening, 
speaking,  and  mathematics.   The  evaluation  results  must  be  reported  to  the 
public  and  may  be  used  to  assist  in  diagnosing  student  needs,  and  to  identify 
curriculum  areas  which  should  be  adjusted  or  modified.   At  the  present  time, 
passing  these  tests  is  not  intended  for  use  as  a  graduation  requirement. 

Each  school  district  has  the  responsibility  of  assessing  its  students 
at  three  levels:   early  elementary,  late  elementary  and  secondary.   At  the 
secondary  level,  school  committees  have  three  options  from  which  to  choose: 
(1)  they  can  develop  their  own  test;  (.2)  they  can  purchase  an  approved  commer- 
cially developed  test;  or  (3)  they  can  use  a  test  provided  by  the  Department 
of  Education. 

The  Department  of  Education  currently  is  in  the  process  of  determining 
the  best  way(s)  to  provide  the  secondary  tests  to  school  districts  in  the 
future.   The  purpose  of  this  paper  is  to  detail  several  of  the  issues  in- 
volved in  providing  multiple  choice  tests.   This  paper  will  be  used  as  a 
beginning  point  for  discussion  with  representative  school  districts  around 
Massachusetts,  and  will  be  revised  as  their  input  is  received. 

ISSUE  //l 
SHOULD  THE  DEPARTMENT  RELEASE  A  SHORT  TEST  OR  A  LONG  TEST,  OR  BOTH? 

Background 

The  original  intent  of  the  Department  was  to  produce  four,  50  to  60 
item  multiple-choice  tests  (a  state  test  and  three  equivalent  forms)  in  the 
areas  of  reading  and  mathematics  for  use  at  the  secondary  level.  Within  each 
content  area,  the  four  forms  would  be  equivalent;  i.e.,  they  would  all  measure 
the  same  objectives  in  the  same  proportions,  and  they  would  be  equally  diffi- 
cult.  An  alternative  to  be  considered  (in  lieu  or  in  addition  to  the  current 
state  test)  is  the  production  of  a  longer  test  with  alternative  forms.   This 
longer  test,  of  say  200  items,  would  measure  the  same  basic  objectives  as  the 
shorter  test;  there  simply  would  be  more  items  per  objective. 

Advantages  of  Releasing  a  Longer  Test 

1.   Greater  coverage  of  Basic  Skills  Improvement  Policy  requirements.   As 
mentioned  earlier,  school  districts  have  several  tasks  to  accomplish  under 
the  policy.   They  must  determine  the  number  and  percentage  of  students  who  have 
exceeded  the  minimum  standards  established,  and  report  that  information  to  the 
public  .  Two  other  tasks  that  may  be  undertaken  are  diagnosis  of  individual 
studentrs  needs  and  evaluation  of  the  local  curriculum. 


B-2 


A  60-item  test    is   of    sufficient   length  only   to  accomplish  the  first 
task  well.      Consequently,    if    the  Department    issued   only   the   shorter   test, 
districts  would   have  to  make  their   own  arrangements  for  addressing   the 
other   activities   to   take  place  under    the  Policy. 

Diagnosis   of    students  who  fail   to  meet   the  minimum   standards  cannot 
be  accomplished   by  a   60-item  test  designed   for   the   general   student 
population.      Diagnosing   the  needs   of   these  students   is  a  much  more  complex 
task,   requiring   not   only   increased   content  coverage  but  also  a  difficulty 
level  that   is   inappropriate  for   the  general  population.      In  fact,    this 
aspect  of   the  policy   is   so  unique  that   it  can  be  argued   effectively   that 
diagnosis   should   be  an  entirely  separate  testing   program— only  those  stu- 
dents who  have  failed   to  meet   the  minimum  standards   on  the  general   test 
should   ever   be  administered   the  diagnostic   test. 

Similarly,    curriculum  assessment  cannot   be  accomplished    by  a   50-item 
test.      Mathematics,    for   example,    has  34   objectives,   and   each  of   the  obj ec= 
tives  requires   several   sequences   of    instruction  before  mastery   of    the  objec- 
tives can  be  expected.      A  60-item  test  would,    at  best,   provide  only  a  frag- 
mentary look  at   the  effectiveness   of    the  local  curriculum. 

A  200-item  test  would   probably  be  constructed    in  three  steps.      A  core 
of    60  items  would   be  used   as  the  basis  for  determining  whether   students 
met   the  minimum   standard.      In  order   to  assure  that  pass/fail  decisions  would 
be  made  as  accurately  as  possible,    these  questions  would   be  fairly  difficult 
for   someone  who  was  right   on  the  borderline  of  passing   and   failing.      These 
50  core  items,    of   course,   would   be  quite  easy  for    the  general   student   popu- 
lation.     Another   100   items  or   so,   more  difficult   than  the  core  set,    would 
be  used   to  assess   the  curriculum .      Finally,   a   set   of    60   items— easier   than 
those   in  the  core— would   be  held    in  reserve.      These  questions  would   be 
given  to  those  who  failed   the  standard    established   by  the  first   150  questions, 
and  would   be  used   to  help  plan  a  course  of   action  for   these  students   so   that 
they  would   be  likely   to  pass  the  standard   in  the  next   testing    session. 

2.      Less  concern  over   test   security.      The  Basic   Skills   Improvement  Policy 
mandates   that  ".    .    .each  public   school  district   shall  give   the   student   and 
his   or   her   parent   the  opportunity  to  review  the  evaluation   instrument  used    .    .  -." 
Students  who  had   taken  the  test  could  request   to   see  it   again,   and  while 
reviewing    it,  memorize  a  few  of   the  questions.      Consequently,    if  a   shorter 
test  were  provided   by  the  Department,    it  would   take  only  a   short  while  before 
all   the  questions  and  answers  were  known  to  all   students.      A  longer   test 
would   alleviate  this  problem   in  two  wayss      If   a  certain  number  of   questions 
became  public   knowledge,    they  would   be  a   smaller   percentage  of   a   longer    test; 
e.g.,    if    students  knew  20  questions  before  going    into   the  exam,    that  would 
comprise  40  percent   of   a  fifty-item  test,    but   only  10  percent  of   a   200-item 
test.      Also,     the  more   questions   that   students    learn   the   answers    to,    the  more   they 
should   know.      This  argument  will  be  carried   to  an   extreme  later   in  this  paper 
when   the  concept   of   releasing   item  pools   of   perhaps   1,500  questions   is  dis- 
cussed.     Some  would   argue  that   if    students  can   learn  1,500  facts  about  mathe- 
matics,   or  can  read   well   enough  to  memorize  1,500  questions  about  reading, 
they   in  fact  have  demonstrated   that   they  have  acquired  more  than  the  necessary 
minimum    of   basic    skills. 


B-3 


Advantages  of  Releasing  a  Shorter  Test 

1.  Time.   A  shorter  test  takes  less  time  to  administer  and  score.   If 
the  Department  released  a  long  test  and  a  district  wanted  to  use  it  only  for 
making  pass /fail  decisions,  it  would  have  to  spend  time  administering  un- 
necessary test  questions.   Of  course,  this  objection  could  be  overcome  if  the 
Department  identified  the  50  or  so  items  that  comprised  the  pass/fail  core  of 
the  test.   Then  the  district  could  administer  only  those  items  to  its  students. 

2.  Developmental  costs.   A  shorter  test  requires  fewer  items  to  be 
written,  and  item  writing  is  an  expensive  undertaking.   This  advantage  is  some- 
what offset  by  the  fact  that,  because  the  test  questions  cannot  be  kept  completely 
secure,  the  fewer  questions  there  are  in  any  one  set,  the  more  often  alternate 
sets  will  need  to  be  released. 

3 .  Usage  Costs.   Local  districts  will  be  required  to  pay  both  for  the 
reproduction  and  scoring  of  the  tests  they  use.   Longer  tests  would  entail 
greater  costs  both  for  reproduction  and  scoring. 

Advantages  for  Releasing  Both 

1.   Choice.   As  mentioned  above,  if  the  Department  were  to  issue  a  long 
test,  it  could  inform  districts  which  subset  of  items  comprise  the  50-item 
pass/fail  core.   This  way,  districts  could  choose  the  test  they  wanted 
according  to  their  intended  use  of  it.   If  they  wanted  to  develop  their  own 
tests  for  curriculum  assessment  and  student  diagnosis,  they  could  select  out 
the  appropriate  50  items  and  administer  only  that  set.   If  they  did  not  want 
to  develop  their  own  tests,  they  could  use  the  complete  test  as  supplied  by 
the  Department . 

ISSUE  #2 
SHOULD  THE  DEPARTMENT  RELEASE  MULTIPLE  FORMS  OF  EACH  TEST? 

Background 

As  mentioned  earlier,  the  Department's  original  intent  was  to  provide 
three  alternative  forms  of  the  test  in  addition  to  the  original  form.   Each 
alternate  form  would  have  the  same  number  of  items,  measure  the  same  objec- 
tives in  the  same  proportions,  and  be  of  equal  difficulty  to  the  original  form. 

Advantages  of  Having  Multiple  Forms 

1.   Choice.   In  any  test  of  reasonable  length,  there  are  going  to  be 
test  questions  that  some  reviewers  feel  are  of  questionable  quality.   By 
providing  multiple  forms,  the  Department  would  be  offering  two  choices  to 
local  districts.   First,  the  district  could  review  all  the  forms  and  select 
the  one  it  felt  was  best.   Second,  if  that  best  form  were  still  short  of  a 
desired  quality  level,  the  district  could  replace  individual  items  in  the 
chosen  form  (subject  to  appropriate  technical  analyses)  with  their  equivalent 
counterparts  in  the  other  forms.   This  way,  the  district  could  construct  a 
test  with  which  it  is  more  likely  to  be  satisfied. 


B-4 


2.   Multiple  administrations.   If  a  district  plans  multiple  adminis- 
trations throughout  the  year,  it  would  seem  necessary  that  multiple 
forms  be  available.   For  example,  suppose  a  district  tests  its  ninth  graders 
in  October,  then  retests  those  who  fail  in  January  and  then  again  in  May. 
Since  those  who  failed  the  test  are  entitled  to  see  it  on  request,  it  would 
seem  that  retesting  them  by  giving  them  the  same  form  of  the  test  they  had 
failed  earlier  would  result  in  passing  quite  a  few  students  who  had  in- 
sufficiently developed  basic  skills. 

Advantages  to  Having  a  Single  Form 

1.  Comparability.   No  two  forms  of  the  same  test  are  going  to  be 
perfectly  equivalent.   If  one  district  gives  Form  A,  and  another  Form  Bs  to 
some  extent  their  results  will  not  be  comparable.  While  it  is  likely  that 
these  differences  will  be  inconsequential,  there  always  is  the  possibility 
that  the  public  will  not  perceive  it  that  way.   However,  this  perception 
should  be  administration  influenced  by  the  fact  the  administration  of  alter- 
nate forms  of  standardized  achievement  tests  is  a  common  practice. 

2.  Cost.   The  cost  of  writing  four  forms,  while  perhaps  not  quite  four 
times  the  cost  of  writing  one  form,  can  be  considerable  larger.   Again, 
however,  these  costs  will  be  offset  somewhat,  because  the  more  items  and 
forms  that  are  being  used,  the  less  frequently  they  must  be  replaced  with 
new  ones . 

ISSUE  #3 
SHOULD  THE  DEPARTMENT  RELEASE  ITS  TEST  AS  AN  ITEM  POOL? 

Background 

As  discussed  above,    the  original   intent   of    the  Department  was   to  pro- 
duce 50  to   60  item  miltiple  choice  tests   in  mathematics  and  reading,    along 
with  three  equivalent  forms  of   each  test.      School   districts   could  use  any 
of    the  equivalent  forms   interchangeably.      An  alternative  system  being   con- 
sidered   is  the  development   of  an   item  pool.      Under    such  a   system,    the  Depart- 
ment would  make  available  to   school  districts,   a   large  pool  of    items    (say, 
between  500  and   1,500   items)    for   each  content  area.      Each  item  released 
would   be  related   to  an  objective  in  the  Basic   Skills   Improvement  Policy.      Each 
district  would  put   together   its  own  test  from  the  available  pool,    taking    into 
account:      1)    the  number  of    items   it  wanted   to   include  for   each  objective; 
2)    the  difficulty   of   the  items;    and  3)    its   own  personal  judgment  of    the 
quality   (clarity,    conciseness,    etc.)    of    the  available   items.      Unless  a  group 
of  districts  chose  to   select  a   set   of    items  jointly   in  a  cooperative  effort, 
it   is   likely  that   school  districts  would  develop  different   tests— all  measuring 
the  same  objectives,    but  constructed   of   different    items. 

Advantages  of   an  Item  Pool 

1.      Flexibility.      Each  district   could   choose  to   construct   its   test   as 
it    saw  fit.      If    it  viewed    one  objective  as   having  more   importance   than  another, 
it   could   construct  a   test   containing  more   items   that  measured   that   objective. 

Each  district   would   be  able   to  review  all   the   items    in   the  pool,    and    in- 
clude  in   its   tests   only   those   items    it   found    satisfactory.      With   the  Department 
issuing   a   test,    even  with   three   equivalent   forms,    such  choice  would   not   be 
available. 


B-5 


A  district  could  develop  its  own  test  to  be  of  a  proper  difficulty 
level  for  its  students.   Districts  with  higher  standards  and/or  more 
capable  students  could  select  questions  that  were  more  difficult  and  more 
challenging  for  its  students. 

A  district  could  design  different  tests  for  the  many  requirements  of 
the  Basic  Skills  Improvement  Policy.  They  could  select  appropriate  items 
from  the  pool  for  a  minimum  skills  pass/fail  test,  another  set  for  curri- 
culum evaluation,  and  a  third  set  for  student  diagnosis. 

2.  Security.   As  mentioned  above,  it  can  be  argued  that  with  a  pool 
of  over  a  thousand  items,  security  no  longer  is  an  issue.   A  student  who 
could  memorize  all  the  answers  to  all  the  questions  would  be  demonstrating 
possession  of  minimum  basic  skills  by  the  very  nature  of  the  activity. 
The  larger  the  pool,  the  easier  the  argument  is  to  make.   Too,  the  pool 
would  not  be  made  public  all  at  one  time,  since  all  that  any  one  student 
would  be  entitled  to  see  would  be  the  items  he  or  she  took.   Thus,  it 
might  be  years  before  the  pool  was  entirely  disclosed — -disclosure  would 
take  place  slowly  enough  so  that  the  Department  would  be  able  to  assess 
the  impact  that  disclosure  was  having  on  the  program.   If  it  reached  a 
point  deemed  to  be  too  high,  sections  of  the  pool  could  be  withdrawn  and 
new  items  written  to  replace  them. 

3.  Costs  of  item  writing.   Somewhat  paradoxically,  it  is  less  expensive 
to  produce  a  1,000  item  pool  than  it  is  to  write  a  200  item  test.   This 
occurs  because  a  200  item  test  must  be  carefully  written.   Each  item  is 
written  to  specific  objectives  and  standards,  field  tested,  reviewed  and  re- 
written as  necessary.   Since  the  items  are  fixed,  it  is  important  that  every 
item  be  perceived  as  being  fair  for  all  groups  of  students  across  the  state. 

In  a  pool,  on  the  other  hand,  districts  will  be  able  to  select  the  items 
that  appeal  to  them.  While  it  would  be  a  poor  practice  to  have  any  great 
number  of  items  that  are  perceived  as  poorly  written,  having  a  few  that  are 
judged  to  be  unfair  by  some  people  (perhaps  the  item  would  contain  some 
idiosyncratic  vocabulary)  is  not  a  major  problem — people  who  don't  like  them 
simply  can  ignore  them.   Since  many  pools  already  exist  which  are  free  to 
anyone  who  wants  to  use  them,  the  Department  could  select  a  couple  of  exist- 
ing pools,  remove  any  items  that  are  deemed  faulty  or  inappropriate  in  an 
initial  review,  code  the  remainder  to  the  Basic  Skills  Objectives,  review  the 
match  between  what  they  have  and  what  is  desired,  and  add  items  where 
necessary. 

Advantage  of  a  Fixed  Test 

1.  Simplicity.   Each  district  would  be  presented  with  a  test  in  con- 
ventional format.   The  public  would  understand  such  a  system  more  readily 
than  it  would  an  item  pool.   District  staff  would  not  have  to  learn  how  to 
construct  a  test;  it  would  not  be  necessary  for  anyone  at  the  local  level  to 
be  involved  in  test  development  at  all. 

2.  Time  costs.   The  Department  would  be  the  only  agency  expending  time 
on  test  development.   Local  districts  that  chose  the  state  test  would  not 
spend  time  putting  items  together  from  a  pool;  items  would  already  be  combined 


B-6 


into  a  test.      If   the  Department    issued  multiple  forms  of   a   test,    some  dis- 
trict  time  would   be   spent   choosing   a  form,    and   perhaps   some  additional 
time  would   be  used   to  replace   individual   items   in  the   selected  forms,    but 
this   time  expenditure  would   be  far   less   than  that   required   to  put   together 
a   test  from  a  pool. 

3.  Security.      While  the  argument  was  made  previously   that   one  of   the 
advantages  of   an   item  pool  was   security,   a   counterargument   on  the  same   issue 
can  be  made  for  a  fixed   test.      A  pool  almost  certainly  would   have  to   be 
published   in  a  hard-copy  format— probably   in  a   loose-leaf   notebook.      Such  a 
publication  could  more  readily  be  stolen  than   several   single   small   tests 
booklets.      Also,    if   a   single  test  were  published   by  the  Department,    it    is 
likely  that  a  new  test  would   be   issued   every  year.      The  problem  of    students 
passing   test   questions  on  to  other   students  would   thereby  be  eliminated. 

4.  Comparability  across  districts.      There   is  a   serious   question  as   to 
whether   it   is  an  advantage  or  disadvantage  to  be  able  to  compare  standards 

on  test  results  across  districts.      No  matter  what   system   is   selected,    it  will 
be  possible  to  make  at   least   some  comparisons  across  districts.      Nonetheless, 
if  all  districts  give  exactly  the  same  test,    inter-district   comparisons  will 
be  straightforward  and  readily  made. 

ISSUE  #4 
SHOULD  THE  DEPARTMENT  PROVIDE  A  COMPUTERIZED  TESTING   SERVICE? 

Background 

If    the  Department  were  to  develop  a  pool  of    items,    it  could   put   them   in 
a  computer   bank  rather  than  releasing   them   in  hard-copy  format   to  all  dis- 
tricts.     What   the  Department  would   send   to  districts  would   be  a  catalog   of 
objectives,    or  descriptors  of    types  of    items.      Each  district  would  decide 
what   it  wanted   its  test   to   look  like,    i.e.,    the  number   of    items   it  wanted 
drawn  from  each  objective.      The  district   then  would   submit   this   "shopping 
list"  to   the  Department,   which  in  turn  would  draw  the   items  as  requested  from 
its  bank,    print   one  or  multiple  copies  of    the  test,    and   send   the  test   to   the 
district.      Within  each  objective,    items  would   be  drawn  at  random.      For   example, 
if   a  district  wanted    its  test   to  contain  two   items  from  objective  1.3 .15. 6. A, 
and   the  Department's  pool  had   10   items  measuring   that  objective,    the  test   the 
district  requested  might  contain   items   #5  and  #7.      If    the  district  requested 
the   same  types  of    items  a   second   time,    it  might  get   items   #3   and   #8.      By  making 
repeated  requests  for   the  same  types  of    items,    a  district  would   have  available 
several  parallel  tests.      Parallel  tests  are  tests  which  measure  the  same  content, 
and  are  of   about  the  same  difficulty,    but  contain  different   items. 

While  a   system  of    this   type  could  be  operated   by  hand,    it  would   be  far 
easier   to  do   by  computer.      Several   such   systems  are   in  operation   in  the  country 
today,   with  at   least   two  of    them  operating   commercially. 


1.      Flexibility.      With  one   exception    (to   be  discussed   below),    computerized 
tests  have  all  the  flexibility  of    item  poolse      Just  as  with   item  pools,   dis- 
tricts could   have   its   tests   constructed   so   that   they  measured   local  objectives 
with  the  proper  weighting,    contained    items   of    the  proper   difficulty,    and   were 
matched   to   the   intended   usage   of    the   test. 


B-7 


2.  Security.      Computerized    tests  would   have  all   the   security 
advantages   of    item  pools,    with  the  additional   benefit    that   the  pool  would 
not   be  published    in  hard-copy.      Consequently,    the   items  would   not   all   be 
together    in  one  place   in  any  district  where  unauthorized  persons  might  have 
the  opportunity  to  review  them. 

3.  Information.      A  difficulty  with  most    systems   is   that   the  Department 
would   have   little  knowledge  about  whether   the   system  was   being  used,    how  it 
was  being  used,    and  how  to   improve   it.      Such  information  often   is  costly  and 
sometimes   impossible  to  obtain.      In  a  computerized   system,    the  Department 
would   know  how  many  districts  were  requesting   tests,   which  objectives  were 
being   chosen  most  often,    and  whether  usage  was   increasing   or  declining   over 
time.      If   a  fixed   test  or   an   item  pool  were  published,    the  Department  could 
obtain   information  of   that  nature  only  from  surveys. 

Disadvantages   of    Computerized   Tests 

1.  Cost .      Costs  of   such  a   system  are  uncertain  at   this  time,    but 
computerized   systems  can  be  expensive.      If    school  districts  have  small 
computers  with  some   idle  time,    the  costs  might  not  be  too  great   to   supply 
them  with  tapes  or  disks  of    the  test  generating   computer  program  and   of   the 
item  bank.      However,    if    the  Department   has  to  establish  its  own  computer   to 
operate   such  a   system,    the  cost  might  be  prohibitive.      The   issues  of   these 
costs  will  be  more  carefully  explored  during   the  feasibility  study. 

2.  Time.      It    is  not   known  yet  whether   local  computers  would   be  used   to 
operate  this   system,    or   if    it  would  be  centralized.      If   centralized,    turn- 
around  time   (the  time   it   takes   to  receive  a  response  to   a  request)   might   be 
so   long   as  to  make  the   system   inoperative.      A  great  deal  would  depend   on  how 
far    in  advance  of   the  testing  date  districts  would  request   their  tests.      With 
proper  planning,    time   should  not   be  a  problem;    but  districts  would  definitely 
have  to  plan   in  advance. 

3.  Individual   item  quality  must   be  high.      One  advantage  of   the  item 
pool  concept  was   that   some  items   of   questionable  generalizability  could   be 
included    in  the  pool  without  destroying   the  credence  of   the  entire  pool. 
However,    in  a  computerized    system,    the  items  must  all  be  acceptable  to  all 
people  since  the  district   is  being  delivered   a  fixed   set  of    items.      This 
means  that   the  initial  costs  of  constructing   the  pool — especially  the  initial 
review  of    items   that   enter   the  pool — will  be  higher  for  a  computerized   system 
than  for  an   item  pool. 

However,    there  are  two  factors   that  would  minimize  these  problems.      First, 
if   a  district  didn't   like  some  particular    items,    they  could  request  replacement 
items.      While  this  would  place  extra  time  and   cost   burdens   on  the  system,    the 
price  would  not  be  too  great   if    it  happened    infrequently.      Second,    the  Depart- 
mant  could  monitor   complaints  about    its   items.      This   information  would  permit 
the  improvement   of   the  bank  over   time,   and   also   enable  the  Department   to  defend 
the  pool    ("This   item  has  been   sent   to   150  districts  over  the  last  four  years, 
and   you're  the  first   one  to   tell  us   that    it  wasn't   any  good.M). 


B-8 


4.   Administrative  complexity.   This  system  would  require  a  far  more 
complex  administrative  structure  than  any  of  the  others.   Objectives 
manuals  and  order  forms  would  be  printed  and  mailed  to  all  districts.   Order 
forms  would  be  returned  to  a  central  location.   Tests  would  be  printed  and 
mailed  back.   Replacement  items  might  be  requested  and  mailed  back.   Billings 
for  test  production  would  have  to  be  mailed  and  payment  received.   With  that 
amount  of  paperwork,  there  is  a  great  potential  for  error.   The  administra- 
tive requirements  of  this  system  far  exceed  other  systems  in  which  a  test 
(or  a  pool)  simply  is  printed  and  mailed.   Above,  it  was  noted  that  a  major 
advantage  to  the  computerized  system  was  control  and  information;  the  Depart- 
ment would  know  a  great  deal  about  how  the  system  was  being  used.   This 
knowledge  does  not  come  without  cost- — the  cost  of  managing  the  system. 

ISSUE  #5 
SHOULD  THE  DEPARTMENT  FURNISH  ADDITIONAL  ITEMS  ON  REQUEST? 

Background 

One  of  the  disadvantages  of  the  Department  publishing  a  fixed  test  (long 
or  short)  is  its  flexibility.   If  a  district  does  not  like  some  of  the 
questions,  or  if  it  thinks  there  are  other  areas  that  should  be  tested,  it 
either  can  accept  the  test  as  it  is,  with  its  perceived  faults,  or  it  can 
write  new  items  for  the  test.   If  districts  were  to  choose  this  latter  course, 
they  would  be  creating  their  own  test  and  would  be  required  to  conduct  appro- 
priate analyses  on  it.   One  way  to  avoid  this  feature  would  be  to  have  dis- 
tricts write  the  Department  detailing  complaints.   The  Department  could  then 
rewrite  the  old  questions,  or  construct  new  items  in  areas  not  covered  by  the 
original  test. 

Advantages  of  Having  the  Department  Write  Additional  Items  on  Request. 

1.  Flexibility.   This  is  one  more  way  of  customizing  tests;  a  way  of 
permitting  the  Department  to  be  responsive  to  local  needs.   As  with  the  item 
pools  and  customized  tests,  this  system  provides  a  means  of  insuring  that 
local  districts  will  feel  that  the  test  they  are  using  is  a  good  one — -that  it 
is  well  written  and  properly  reflects  local  needs— without  requiring  the  local 
district  to  hire  and  train  its  own  item  writers. 

2.  Information.  As  with  the  computerized  system,  this  system  provides 
for  the  Department  to  receive  feedback  from  local  districts.   Since  local 
staff  would  be  creating  the  demand  for  new  items,  this  system,  more  than  any 
other,  would  inform  the  Department  of  new  trends. 

Disadvantages  of  Having  the  Department  Write  Additional  Items  on  Request 

1.   Cost.   Depending  on  demand,  the  costs  involved  in  this  system  could 
be  overwhelming.   If  250  districts  each  made  requests  for  10  different  sets  of 
new  items,  the  costs  for  producing  those  items  could  exceed  a  million  dollars. 
Provided  that  districts  make  similar  requests,  however s  costs  would  be  far 
less.   For  similar  requests,  only  one  set  of  items  would  need  to  be  written 
and  sent  to  all  requesting  districts.   The  crux  of  this  issue  is  the  commonality 
of  requests t      if  many  districts  request  items  in  the  same  area,  then  it  would 


B-9 


seem  that  basic  skills  objectives  should  have  been  written  in  that  area 

from  the  beginning.   If  districts  don't  request  items  for  the  same  objectives, 

then  cost  to  the  Department  would  be  excessive. 

2.  Time.   The  Department  would  have  access  to  just  so  many  qualified 
item  writers.   If  the  demand  for  new  items  were  high,  the  turn-around  time 
for  requests  might  run  several  months.   Once  again,  while  the  potential 
demand  is  known  to  be  high,  what  the  actual  demand  would  be  is  purely 
speculative.   However,  the  demand  would  not  have  to  be  very  great  before  the 
Department  would  find  itself  completely  overwhelmed. 

3.  Communication.   An  assumption  of  this  system  is  that  a  local  dis- 
trict would  be  able  to  communicate  its  request  to  the  Department  clearly 
enough  so  that  the  Department  could  respond.   The  local  district  would  have 
to  write  the  objectives  that  the  Department  would  use  to  construct  the  items. 
If  the  objectives  were  vague  or  poorly  written,  it  is  likely  that  the  items 
the  Department  would  come  up  with  would  be  unacceptable  to  the  local  district. 
Consequently,  the  local  district  would  have  to  write  clear,  highly  specific 
objectives  (or  item  specifications).   But  that  is  the  most  difficult  and 
time-consuming  part  of  item  writing.   In  other  words,  by  the  time  the  local 
district  staff  could  communicate  to  the  Department  what  it  wanted,  it  probably 
could  have  written  the  item  themselves. 

ISSUE  #6 
SHOULD  THE  DEPARTMENT  PROVIDE  NORMS  WITH  ITS  TESTS? 

Background 

As  part  of  the  feasibility  study,  we  are  going  to  examine  the  types  of 
statistics  that  could  or  should  be  provided  along  with  the  tests  to  local 
districts.   One  type  of  statistic  to  be  examined  is  norms.  Norms  tell 
someone  what  percentage  of  people  in  the  norming  group  (e.g.,  all  first- 
semester  ninth  grade  students  in  public  schools  in  Massachusetts)  fall  below 
a  certain  point  on  the  score  scale.   As  with  all  statistics,  norms  can  be 
used  well  or  poorly,  and  consequently,  there  will  be  no  discussion  of 
"advantages"  and  "disadvantages."  The  costs  of  obtaining  norms  are  not  great, 
providing  that  they  don't  have  to  be  highly  precise.   In  connection  with  the 
Basic  Skills  Improvement  Policy,  norms  could  be  properly  used  to  help  districts 
establish  standards.   For  example,  knowing  how  hard  a  test  item  would  be  for  a 
tenth  percentile  student  to  answer  it  correctly.   On  the  other  hand, 
norms  can  be  used  quite  improperly.   One  example  would  be  setting  a  standard 
of  the  50th  percentile  ("I  want  all  of  our  kids  to  be  able  to  score  above 
average!"),  and  then  each  year  failing  half  the  class. 

ISSUE  #7 
SHOULD  THE  DEPARTMENT  ASSESS  COMPETENCE  IN  WRITING  VIA  A  HOLISTICALLY-S CORED 
WRITING  SAMPLE,  AN  ANALYTICALLY  SCORED  WRITING  SAMPLE,  OR  BOTH.       ^' 

Background 

The  Massachusetts  Department  of  Education  is  currently  providing  districts, 
upon  request,  with  multiple  forms  of  a  writing  test.   Each  form  requires  the 
student  to  compose  two  writing  samples  on  two  well-defined  topics  with  the  aid 


B-10 


of   a  dictionary.      Last   year's   test,    for   instance,    required  writing   a   letter 
of   complaint   and   an   essay  about    someone  vivid    in   the   student's  memory. 
Writing    samples  were  scored   holistically ,    that    is,    each  writing    sample  was 
rated   on  a   scale  of    1   to  4   by   each  of   two  readers.      Ratings  from   the   two 
readers  were  summed   to   obtain  a  final   score.      In  keeping   with  the  principles 
of   holistic   scoring,    readers  were   instructed   to  rate  the  writing    samples 
on  overall  quality,    or   the  overall   impression  they  made,    rather   than  analyze 
them  for   specific   strengths  and  weaknesses. 

There  are  at   least   two  alternatives   to  the  current   system.      (1)    Scoring 
could   be  done  analytically,    by  awarding   points  on  each  of    the  basic   skills 
writing   objectives.      Minimum   standards  could   be  established   for   each  objec- 
tive*     such  as   the  number   of  misspelled  words  or   number  of   punctuation  errors 
allowable.      (2)    The  features   of   holistic  and   analytic   scoring   could   be  combined 
in  a  way  that  permits  readers   to  respond   holistically  to  a  few  broad   analytic 
categories.      Rating    systems  of    this   sort  are   sometimes  called   primary  trait 
systems.      An  example  of   such  a   system   is  Diederich's  Scale   in  the  table  below. 


TABLE   6.1      Diederich's  Scale  for  Grading  English  Composition 


;  =  Poor 


2  =  Weak 


3  =  Average 


4  =  Good 


5  =  Excellent 


Quality  and  development  of  ideas  12   3  4   5 

Organization,  relevance,  movement  12   3  4   5 

Style,  flavor,  individuality  12  3  4  5 

Wording  and  phrasing  12  3  4  5 

Grammar,  sentence  structure  12  3  4  5 

Punctuation  12   3  4  5 

Spelling  12  3  4  5 

Manuscript  form,  legibility  12  3  4  5 


Subtotal 


Subtotal 


.x  5  =. 


.x  3  =. 


Subtotal 
Total  grade 


.x  1  =. 


Source:  Adapted  from  A.  Jewert  and  C.  E.  Bish,  eds..  Improving  English  compo- 
•iitinn  Washington:  National  Education  Association,  1965.  Reprinted  by  permis- 
sion of  the  National  Education  Association. 


Advantages  of  a  Holistically  Scored  Writing  Sample 

The  major  advantage  of  the  current  system  is  the  ease  with  which  writing 
samples  can  be  scored  via  the  holistic  method.   By  ignoring  specific  basic 
skills  writing  objectives  and  scoring  samples  as  an  integrated  whole,  reasonably 
reliable  estimates  of  students'  relative  writing  abilities  can  be  obtained  for 
a  large  group  of  students  in  a  short  period  of  time.   With  a  testing  program 
as  large  as  that  involved  in  basic  skills  assessment,  ease  of  scoring  is  no 
small  matter  and  must  weigh  heavily  against  the  advantages  of  other  approaches. 

Disadvantages  of  a  Holistically  Scored  Writing  Sample 

The  ease  of  holistic  scoring  is  offset  by  the  limited  information  it 
provides.   The  total  score  permits  ordinal  scaling  of  students  according  to 
ability,  but  provides  no  clue  as  to  which  specific  writing  skills  were  weak  or 
strong.   In  other  words,  we  know  a  student's  position  relative  to  some  of 


B-ll 


his  peers,    but  we  don't  know  why  he  or   she   is   in  that  position,,      Even  a 
post   hoc   analysis   of   writing    samples  with  different    scores   cannot   reveal 
which  particular  aspects  of    the  samples  were   instrumental   in  determining 
the  reader's  final   evaluation.     Without   such  information,    it    is   impossible 
to  determine  whether   low-scoring    students  were  deficient   in  spelling, 
grammar,   punctuation,    handwriting,   choice  of  words,    organization  of    ideas, 
or  higher  order   thought  processes.      Any  one  or   any  combination  of   these 
factors  may  have   influenced   the  reader. 

Advantages  of  Alternate  Scoring   Systems 

The  more  analytic   a   scoring   system   is   the  more  information   it  provides 
about   the  relative  strengths  and  weaknesses  of    individuals  and   the  relative 
strengths  and  weaknesses   of    the  writing  curriculum.      An  analytic   system, 
however,    can  be  advocated   only  to   the  extent   that   local  districts  need  or 
want   that   additional    information. 

Disadvantages  of  Alternative  Scoring   Systems 

Analytic   scoring   systems  are  more  time-consuming   than  a   strict   holistic 
system  and  require  more  training    in  order   to  yield   reliable  results.      It    is 
also  possible  that   the  readers  will  be  so   intent  upon  picking   apart   individual 
features  of   writing    sample,    that   they  will  miss   seeing   the  overall   effective- 
ness of    the  writer. 


APPENDIX  C 


Workshop  Questionnaire 

After  introducing  the  item  pool  concept  and 
completing  several  exercises  in  the  use  of  item 
pool  statistics,  the  following  questionnaire  was 
distributed  to  all  workshop  participants „ 


C-l 


Your  Position  in  the  School  District 


1.   Is  your  district  planning  to  use  one  or  more  of  the  state's  Basic  Skills 
Tests  during  1980-81? 

Yes     No 


2.   The  Department  will  be  publishing  a  test  from  an  item  pool  for  use  by 
local  districts.   The  Department,  in  addition,  could  provide  that 
item  pool  to  local  districts  and  allow  them  to  develop  their  own 
equivalent  form  of  the  state  test.   Would  you  favor  the  distribution 
of  such  an  item  pool. 

Yes     No 


3.   Do  you  think  that  your  district  would  use  the  item  pool  if  it  were 
released? 

Yes     No 

If  you  answered  "yes"  to  the  previous  question,  which  of  the  following 
two  ways  would  your  district  likely  use  the  pool? 

a.  To  replace  items  in  the  state  test. 

b.  To  develop  a  test  of  different  length  from  the  state  test. 


4.   If  your  district  were  provided  with  all  necessary  information,  do 
you  believe  your  district  would  be  able  to  make  effective  use  of: 


DCVs? 

Yes 

No 

P-values? 

Yes 

No 

R-values? 

Yes 

No 

5.   Which  type(s)  of  item  statistics  do  you  think  your  district  would  like 
to  receive  if  an  item  pool  were  distributed  by  the  Department? 

DCV     P-values     R-values     None  of  these 


6.   Which  type(s)  of       statistics  do  .you  think  your  district  would 
object  to  receiving? 

DCVs     P-values     R-values     None  of  these 


APPENDIX  D 

Results  of  the  Workshop  Questionnaire 

The  following  tables  summarize  data  from  the 
questionnaire  in  Appendix  C.   Results  are  broken 
down  for  each  workshop  location,  with  statewide  totals 
in  the  extreme  right-hand  column. 


D-l 


Results  of  Statewide  Basic  Skills  Workshop  Questionna 


ire 


March,  1980 


Part  I 

%             4 

Number 

of  Responses  by  Location 

/ 

Question: 

„0       c^"     v' 

%  x 

Total  //  of  respondents 

15 

20 

17 

48 

25    32 

157 

1.   Using  State's 

Y 

15 

19 

13 

39 

22    27 

135 

86 

Basic  Skills  Tests? 

N 

0 

0 

3 

6 

2     3 

14 

9 

2.   Favor  release 

Y 

13 

20 

15 

47 

24    31 

150 

96 

of  item  pool? 

N 

2 

0 

2 

1 

1     1 

7 

4 

3.   Use  pool? 

Y 

11 

18 

15 

45 

23    30 

142 

90 

N 

3 

1 

2 

3 

2     2 

13 

8 

3a.   To  replace  items 

12 

17 

13 

37 

24    24 

127 

81 

3b.   To  develop  new  test 

5 

8 

6 

20 

13    12 

64 

41 

4.   Make  effective  use  o 

f: 

DCVs 

Y 

10 

17 

13 

41 

25    27 

133 

85 

N 

4 

1 

3 

1 

0     0 

8 

5 

P-values 

Y 

11 

17 

12 

42 

23    28 

133 

85 

N 

3 

1 

3 

1 

2     2 

12 

8 

R-values 

Y 

11 

16 

10 

25 

14    24 

100 

64 

N 

3 

4 

5 

11 

9     7 

39 

25 

5.   Like  to  receive: 

DCV 

4 

7 

7 

22 

15    15 

70 

45 

P-value 

4 

7 

8 

31 

15    19 

84 

54 

R-value 

10 

16 

9 

20 

10    19 

84 

54 

6.   Object  to  receiving 

DCV 

2 

2 

4 

4 

0     0 

12 

8 

P-value 

1 

1 

3 

3 

3     2 

13 

8 

R-value 

0 

1 

7 

11 

9     6 

34 

22 

None  of  these 

12 

16 

7 

30 

14     24 

103 

66 

D-2 


suits  or  S 


tatevide  F-asic  Skills  Workshop  Questionnaire 


March,  1980 


1 

Pa 

rt  11 

1 

Rating 

umber  of  Responses  by  Location   / 

1 

1 

Question 

£$°               O*      V      ^       »*      / 

2?      <#      ?      <?      ^     / 

/  s 

Replace  Items  in  test 

\   Very  Undesirable 

3 

3 

0     3     0    2 

11 

7 

}   Undesirable 

0 

0 

0     112 

4 

3 

I  Desirable 

7 

4 

6    12    8    8 

45 

29 

\   Very  Desirable 

5 

13 

11   31   15   21 

96 

61 

Change  length  of  test 

Very  Undesirable 

:   A 

4 

0    2    0    3 

13 

8 

Undesirable  . 

0 

2 

2    12    3 

10 

6 

Desirable 

7 

6 

8   17   10   14 

62 

39 

Very  Desirable 

4 

8 

7   26    10   13 

68 

43 

Compare  difficulty 

Very  Undesirable 

2 

1 

2    1    0    2 

8 

5 

Undesirable 

0 

0 

0    2    10 

3 

2 

Desirable 

6 

5 

10   14     9   14 

58 

37 

■ 

Very  Desirable 

7 

14 

5   28   14   17 

85 

.54 

Compare  Average 

Very  Undesirable 

3 

0 

3    7    3    4 

20 

13 

score  to  state- 
wide average 

Undesirable 

3 

0 

5    3    2    2 

15 

10 

Desirable 

4 

9 

7   16    9    9 

54 

34 

Very  Desirable 

5 

11 

2   21   10   18 

67 

43 

„  What  percent 

Very  Undesirable 

1 

1 

2   11    4    1 

20 

13 

statewide  would 
pass  your  standard 

Undesirable 

5 

2 

7    8    6    3 

31 

20 

Desirable 

6 

11 

5    16    8   16 

62 

39 

Very  Desirable 

3 

6 

3   11    6   13 

42 

27 

.  Make  standard 

Very  Undesirable 

2 

0 

13    11 

8 

5 

equivalent 

Undesirable 

1 

2 

0    5    6    3 

17 

11 

Desirable 

6 

4 

7   16    9   13 

55 

35 

Very  Desirable 

6 

12 

8   20    8   16 

70 

45 

Continu* 

2d  .  . 

0      •      *      • 

D-3 


Results  of  SLarewide  Basic  Skills  Workshop  Questionnaire 


March 

,  1980 

- 

1 

Part  II 

| 

(Cont 

inued) 

Number 

of  Responses  by 

LiOcatio 

n  / 

1 

Question 

Rating 

/*?           &        *' 

*"V     J? 
V     ^     "^ 

/// 

% 

I 

V. 

Relative  strengths 

Very  Undesirable 

2     2    0 

4    0    2 

10 

6 

and  weaknesses 

Undesirable 

12    1 

14    2 

11 

7 

1 

Desirable 

2     4    6 

13    9   13 

47 

30 

Very  Desirable 

10    12   10 

27    11   17 

87 

55 

1 

VI. 

Changes  in 

Very  Undesirable 

2     2    0 

4    0    2 

10 

6 

strengths  and 
weaknesses 

Undesirable 

12    1 

3    6    4 

17 

11 

I 

Desirable 

5     4    9 

19   11   13 

61 

39 

Very  Desirable 

7    12    6 

18    7   14 

64 

41 

1 

Overall  Rating 

Excellent 

7    10    1 

2  7    8   14 

67 

43 

of  Workshop 

Quite  Good 

5     9   11 

14   15   17 

71 

45 

| 

Fair 

2     15 

6    3    0 

16 

10 

Inadequate 

10    0 

0    0    0 

1 

1 

1 

• 

Poor 

0     0    0 

0    0    0 

0 

0 

P 

D-4 


Question 


Test  Construction 


I.   Replace  Items 
II.   Change  Length 


Average  Ranking 
(Scale  of  1-4) 


3.45 
3.21 


%  giving 
a  "desirable" 
rating 


90 
82 


%  giving 
an  "undesirable" 

rating 


10 

14 


Test  Interpretation 

I.   Compare  Difficulty 

II.   Compare  to  State 

III.   Compare  State  to 
Your  Standard 

IV.   Make  Standard 
Equivalent 

V.   Strengths  and  Weaknesses 

VI.   Changes  in  Strengths 
and  Weaknesses 


3.43 
3.08 
2.81 

3.25 

3,36 
3.18 


91 

77 
66 

80 

85 
80 


7 
23 
33 

16 

13 
17 


Overall  rating 
of  workshop 
(scale  of  1-5) 


4.32 


II 

if 
II 

II 

1 
I 


